The scope of video editing and manipulation techniques has dramatically increased thanks to AI. Video inpainting algorithms for example can complete missing regions in video frames with plausible content, with practical applications in corrupted video restoration, unwanted object removal and video retargeting. However, despite the significant progress in video inpainting afforded by 3D convolutions and recurrent networks, today’s state-of-the-art (SOTA) methods can still struggle with issues such as blurriness and temporal artifacts.
A recent paper by a team of researchers from Sun Yat-sen University, Key Laboratory of Machine Intelligence and Advanced Computing and Microsoft Research Asia introduces a joint Spatial-Temporal Transformer Network (STTN) to tackle such video inpainting challenges.

Current SOTA methods apply attention modules to capture long-range correspondences so that visible content from temporally distant frames can be used to fill missing regions in a target frame. This fixed a problem wherein previous methods suffered from temporal artifacts due to missing regions being filled using information only from nearby frames. These SOTA approaches still face limitations such as inconsistent frame matching caused by complex motion in videos.
To address this, the researchers added a twist, so in their STTN formulation takes both neighbouring and distant frames as conditions and simultaneously fills missing regions in all input frames. In this way, the temporal consistency in neighbouring frames is preserved while the completed frames remain coherent with “the whole story” of the video.
The core design of STTN is a transformer that searches for coherent content from all the frames along both spatial and temporal dimensions using a multi-scale patch-based attention module. The module is responsible for extracting patches of various scales from all video frames in order to cover the different appearance changes caused by complex motions in the video. At the same time, the multi-head transformer calculates similarities on spatial patches across different scales.


In extensive qualitative and quantitative evaluations, the STTN model outperformed state-of-the-art methods, with improvements of 2.4 percent and 19.7 percent on PSNR (Peak Signal-to-Noise Ratio) and VFID (Video-based Fréchet Inception Distance) respectively. PSNR is a popular metric for video quality assessment, and many inpainting models employ FID as a useful perceptual metric. Additionally, the researchers note that the visual results from their model are perceptually pleasing and coherent.



The team identifies a number of STTN limitations. When continuous quick motions occur in videos, this may generate blurs in large missing masks, the binary images containing pixels belonging to moving objects in the scene. To improve and enhance short-term coherence, the team plans to extend the transformer by using attention on 3D spatial-temporal patches in the future.
The paper Learning Joint Spatial-Temporal Transformations for Video Inpainting is on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors
This report offers a look at how the Chinese government and business owners have leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle.
Click here to find more reports from us.
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

0 comments on “Joint Spatial-Temporal Transformation Learning Boosts Video Inpainting Performance”