Image diffusion models have emerged as a game-changing method for generating high-quality images that can be edited via users’ natural language text prompts. However, applying image diffusion models to video editing produces inconsistent results, as they struggle with preserving source video content and temporal coherence across frames.
A research team from Adobe Research and University College London addresses these issues in the new paper Pix2Video: Video Editing Using Image Diffusion, introducing Pix2Video, a novel framework that employs a pretrained image diffusion model to enable faithful and realistic text-guided video editing without additional training.
The Pix2Video process comprises two simple steps: 1) A pretrained structure-guided image diffusion model performs text-guided edits on an anchor frame; 2) The changes are then progressively propagated to the future frame(s) via a self-attention feature injection approach where the self-attention layers perform cross-frame attention to enable consistent image generation.
Given a sequence of frames from a video clip, Pix2Video first inverts each frame using a denoising diffusion implicit model (DDIM) and considers it as the initial noise for the denoising process. A reference frame is then selected and its self-attention features injected into the UNet to edit each frame. At each diffusion step, the input features to the self-attention module are projected into queries, keys, and values to enable Pix2Video to capture global information, and the latent of the current frame is updated guided by the reference frame.
In their empirical study, the team compared Pix2Video with state-of-the-art image and video editing approaches such as Text2Live and SDEdit on the DAVIS video object segmentation dataset. In the evaluations, Pix2Video achieved performance on par or better than the baselines, demonstrating its ability to edit videos with either a clear foreground object or multiple foreground objects, better preserve the structure of the input video, and maintain a good balance between edit quality and consistency without additional training.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.