The rapid development of AI models such as variational autoencoders (VAE) and generative adversarial networks (GAN) that can generate audio, images and video has opened a Pandora’s box of digital fakery. Today’s models are able to synthesize highly-convincing images and voices and can even swap a person’s face onto a video clip. The techniques lag however in natural video generation, where research remains early stage and even state-of-the-art results are disappointing.
In a new paper, UK research company DeepMind introduces DVD-GAN (not “digital versatile disc” but “dual video discriminator”) for video generation on large-scale datasets. DVD-GAN can produce videos at resolutions up to 256×256 and lengths up to 48 frames. The technique has achieved state-of-the-art results on the Fréchet Inception Distance prediction task for Kinetics-600, and a state-of-the-art Inception Score for synthesis on the UCF-101 dataset.
Below is a set of four-second synthesized video clips trained on 12 128×128 frames from Kinetics-600, a large dataset of 10-second, high-resolution YouTube clips originally created for the task of human action recognition.
At first glance the clips seem to present recognizable actions such as dancing, skiing, and jumping. But a closer examination reveals much of the generated video content is blurred, indistinct, or even surreal. Is that man’s head flying off his body? Is that an alien scrubbing a yak?
Below is another random batch of truncated DVD-GAN samples trained on 48 frames of 64×64 Kinetics-600. The synthesized baby faces look fairly realistic — or do they?
Despite the many visual aberrations, the DVD-GAN model has improved video generation performance. Although large-scale high-quality data is the fuel that drives machine learning model performance, researchers had struggled to train previous video-generation models efficiently on large datasets due to high data complexity and computational requirements.
DeepMind has overcome this challenge by extending its home-grown image generation model BigGAN to video and introducing extra techniques to accelerate training, including a dual-discriminator architecture consisting of a spatial discriminator and a temporal discriminator, and separable self-attention applied in a row attending over the height, width and time axis.
Researchers evaluated DVD-GAN on UCF-101, a smaller dataset of 13,320 videos of human actions, and the model produced samples with a state-of-the-art Inception Score of 32.97. In another test, the DVD-GAN model with some modifications eclipsed prior work on frame-conditional prediction for Kinetics-600. DeepMind also established a new benchmark test for generative video modeling on Kinetics-600, with DVD-GAN results as a strong baseline.
DeepMind researchers admit in their paper that much work remains to be done before realistic videos can be consistently generated in an unconstrained setting, but they believe DVD-GAN is heading in the right direction.
One takeaway from the model’s imperfect results might be that we should not underestimate the speed with which an AI technique can improve. It took GANs only 4.5 years to progress from monochrome, blurry human face generation to high-fidelity portraits that fool even discerning human viewers. Maybe in a couple of years a wave of similarly realistic and highly-convincing video clips will flood the Internet.
Read the paper Efficient Video Generation on Complex Datasets on arXiv.
Journalist: Tony Peng | Editor: Michael Sarazen