Google & HUJI Present Dreamix: The First Diffusion Model for General Video Editing

Diffusion models like Stable Diffusion — which introduce random noise to data and then learn to generate new samples from the noise — have achieved state-of-the-art performance in generating realistic text-driven images and videos. Such models however are focused on synthesizing, not editing. While a number of intuitive text-based approaches have been proposed for image editing in recent papers, methods for video editing remain relatively underexplored.

In the new paper Dreamix: Video Diffusion Models Are General Video Editors, a team from Google Research and the Hebrew University of Jerusalem presents Dreamix, a novel approach that leverages a video diffusion model (VDM) to enable text- and image-based motion and appearance video editing.

The team summarizes their main contributions as follows:

Proposing the first method for general text-based appearance and motion editing of real-world videos.
Proposing a novel mixed finetuning model that significantly improves the quality of motion edits.
Presenting a new framework for text-guided image animation, by applying our video editor method on top of simple image preprocessing operations.
Demonstrating subject-driven video generation from a collection of images, leveraging our novel finetuning method.

The proposed Dreamix is a text-guided video diffusion approach that enables video appearance and motion editing from text prompts while maintaining video smoothness. Dreamix first corrupts an input video via downsampling and the addition of Gaussian noise. The VDM then uses the low-resolution details in the degraded input video to synthesize new high spatiotemporal resolution information guided by a text prompt, and upscales the video to its final resolution.

The team also applies Dreamix for general, text-conditioned image-to-video editing via a novel framework that conditions image animation on text prompts and can also be used to generate text-conditioned videos from input images.

In their empirical study, the team compared Dreamix’s text-conditioned editing, image animation and subject-driven video generation performance against state-of-the-art Text-to-Video and Plug-and-Play (PnP) baselines on a dataset comprising 29 YouTube videos and 127 varied text prompts. Dreamix outperformed both baselines in the human-rated evaluations, displaying “unprecedented” video editing and image animation abilities.

Demo videos are available on the project’s GitHub. The paper Dreamix: Video Diffusion Models Are General Video Editors is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Google & HUJI Present Dreamix: The First Diffusion Model for General Video Editing

Like this:

1 comment on “Google & HUJI Present Dreamix: The First Diffusion Model for General Video Editing”

Leave a Reply Cancel reply

Related

Share this:

Like this:

1 comment on “Google & HUJI Present Dreamix: The First Diffusion Model for General Video Editing”

Leave a Reply Cancel reply

Related