Motion is one of the most conspicuous visual cues in the natural world, and humans possess a remarkable sensitivity to it. While humans effortlessly perceive motion, training a model to learn realistic scene motion presents substantial challenges due to the intricacies involved in measuring and capturing physical properties on a large scale.
Fortunately, recent advancements in generative models, particularly conditional diffusion models, have ushered in a new era of modeling highly intricate and diverse distributions of real images based on text input. Moreover, recent research indicates that extending this modeling to other domains, such as videos and 3D geometry, holds significant potential for downstream applications.
In a paper titled “Generative Image Dynamics,” a Google research team introduces an innovative approach to model natural oscillation dynamics using a single static image. This approach yields photo-realistic animations derived from a lone image, surpassing the performance of previous methods by a substantial margin. Furthermore, it opens doors to various other applications, such as the creation of interactive animations.
This model is trained on automatically extracted motion trajectories from a large collection of real video sequences. Conditioned on an input image, the trained model predicts a neural stochastic motion texture: a set of coefficients of a motion basis that characterize each pixel’s trajectory into the future.
The core of this model’s training process lies in automatically extracted motion trajectories from an extensive collection of real video sequences. Given an input image, the trained model predicts a neural stochastic motion texture, which consists of coefficients representing the motion basis for each pixel’s trajectory into the future.
This prediction occurs via a diffusion model, which generates coefficients one frequency at a time while coordinating these predictions across frequency bands. The resulting frequency-space textures can then be converted into dense, long-range pixel motion trajectories. Together with an image-based rendering diffusion model, these trajectories can be employed to synthesize future frames, effectively transforming static images into lifelike animations.
In contrast to prior models relying solely on raw RGB pixel data, this motion-based representation captures a more fundamental, lower-dimensional underlying structure that efficiently accounts for variations in pixel values. Consequently, the proposed approach results in more coherent, long-term generation and affords finer control over animations when compared to earlier methods that perform image animation through raw video synthesis.
In their empirical study, the research team rigorously compared their approach against several recent single-image animation and video prediction methods. The results clearly demonstrate that their proposed approach beats prior single-image animation benchmarks in terms of both image and video synthesis quality.
Overall, this work represents a highly promising breakthrough, capable of generating photo-realistic animations from a single static image while significantly surpassing the performance of previous baseline methods.
The paper Generative Image Dynamics on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.