Text-to-image diffusion models, powered by personalization techniques such as DreamBooth and LoRA have made impressive progress in the generation of high-quality static images guided by text at affordable cost. Given their board applications, researchers and practitioners begun to ponder the possibility of extending this to the creation of animated images.
Addressing this curiosity, a collaborative research team from Shanghai AI Laboratory, The Chinese University of Hong Kong and Stanford University introduces a new framework AnimateDiff with their new paper AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. AnimateDiff is a general and practical platform capable of generating animated images for any personalized text-to-image (T2I) model, without the need for extra training or model-specified tuning.


The goal of this work is to transform a T2I model into an animation generator, preserving its original domain knowledge and quality while with little or without extra training costs.

The pipeline of AnimateDiff is constructed as: given a base personalized T2I model, AnimateDiff first trains a motion modeling module based on video datasets. In this stage, only the parameters of motion module are updated, therefore the features of the based T2I model are preserved. Then in the inference stage, the trained motion module is used to transform any personalized model tuned upon the base T2I model into the target animation generator. Finally, the transformed animation generator will generate diverse and personalized animated images.
Notably, by training a generalizable motion modeling module separately, no more specific tuning needed as all the pre-trained weights are preserved, therefore saving a large amount of tuning costs.
For the structure of the motion modeling module, the team selects vanilla temporal transformers as their base model, which better captures the temporal dependencies between features at the same location. They also insert the receptive field at every resolution level of the U-shaped diffusion network, as such the receptive field has been enlarged substantially. Additionally, they use sinusoidal position encoding in self-attention block to enable the network be aware of the temporal location of the current frame.

In their empirical study, the team compared AnimateDiff with Text2Video-Zero baseline. AnimateDiff generate consistent content while maintaining high image quality.
Overall, this work validates that AnimateDiff is a simply yet effective framework for personalized animation, and it has great potential for broad animation applications.
Code and pre-trained weights will be publicly available at our project page. The paper AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Shanghai AI Lab, CUHK & Stanford U Extend Personalized Text-to-Image Diffusion Models Into Animation Generators Without Tuning”