A problem with video to video synthesis (vid2vid) is the technique’s forgetfulness. Take the 2016 “Mannequin Challenge” viral video trend as an example: people remain frozen while a camera passes through them to capture the scene. Viewers would naturally be confused if the camera view returned to a previously captured person but their face appeared totally different — as if they were wearing a magically transformative mask.
This phenomenon has plagued vid2vid methods that generate video frames based only the information available in the immediately preceding frame(s). For AI researchers, achieving vid2vid temporal consistency over the entire rendered 3D world is a challenge. A team of NVIDIA researchers addresses the problem in the paper World-Consistent Video-to-Video Synthesis, which proposes a novel vid2vid framework that utilizes all past generated frames during rendering.
The researchers say that although current vid2vid models have enhanced the generation results in terms of photorealism and short-term temporal stability, methods for generating videos with long-term temporal consistency have lagged. In practical terms, the issue can be problematic for example when a car dash camera revisits a location in a virtual city. Despite using the same semantic inputs, existing vid2vid methods could synthesize a scene that is different from what they generated when the car first visited the location. “This is because they lack knowledge of the 3D world being rendered and generate each frame only based on the past few frames,” the researchers explain.
The team introduces the concept of guidance images: “physically-grounded estimates of what the next output frame should look like, based on how the world has been generated so far.” Using a series of images captured with known camera positions and parameters, an output image can be generated for a particular viewpoint. This output image is then back-projected to the scene and a guidance image for the following camera position is created to help generate an output that is consistent across views and smooth over time.
At the centre of the generator network is the novel multi-SPADE ( SPatially-Adaptive (DE)normalization) module, which includes multiple SPADE operations and uses input labels, warped previous frames and guidance images to modulate the features in each layer of the generator.
The researchers trained and evaluated their method on three datasets — Cityscapes, MannequinChallenge, and ScanNet. Both quantitative and visual results showed the effectiveness of new synthesis architecture in achieving world consistency. “The output video is consistent within the entire rendered 3D world,” the researchers conclude.
The team says the new approach could be suitable for applications such as rendering scenes in game engines and enabling simultaneous multi-agent world creation and exploration.
The paper World-Consistent Video-to-Video Synthesis is available on GitHub.
Journalist: Fangyu Cai | Editor: Michael Sarazen
This report offers a look at how the Chinese government and business owners have leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle.
Click here to find more reports from us.
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.