AI Machine Learning & Data Science Research

Adobe & UCL’s Pix2Video: Text-Guided Video Editing via Image Diffusion Without Preprocessing or Finetuning

In the new paper Pix2Video: Video Editing Using Image Diffusion, an Adobe Research and University College London team presents Pix2Video, a framework for realistic text-guided video editing using a pretrained image diffusion model.

Image diffusion models have emerged as a game-changing method for generating high-quality images that can be edited via users’ natural language text prompts. However, applying image diffusion models to video editing produces inconsistent results, as they struggle with preserving source video content and temporal coherence across frames.

A research team from Adobe Research and University College London addresses these issues in the new paper Pix2Video: Video Editing Using Image Diffusion, introducing Pix2Video, a novel framework that employs a pretrained image diffusion model to enable faithful and realistic text-guided video editing without additional training.

The Pix2Video process comprises two simple steps: 1) A pretrained structure-guided image diffusion model performs text-guided edits on an anchor frame; 2) The changes are then progressively propagated to the future frame(s) via a self-attention feature injection approach where the self-attention layers perform cross-frame attention to enable consistent image generation.

Given a sequence of frames from a video clip, Pix2Video first inverts each frame using a denoising diffusion implicit model (DDIM) and considers it as the initial noise for the denoising process. A reference frame is then selected and its self-attention features injected into the UNet to edit each frame. At each diffusion step, the input features to the self-attention module are projected into queries, keys, and values to enable Pix2Video to capture global information, and the latent of the current frame is updated guided by the reference frame.

In their empirical study, the team compared Pix2Video with state-of-the-art image and video editing approaches such as Text2Live and SDEdit on the DAVIS video object segmentation dataset. In the evaluations, Pix2Video achieved performance on par or better than the baselines, demonstrating its ability to edit videos with either a clear foreground object or multiple foreground objects, better preserve the structure of the input video, and maintain a good balance between edit quality and consistency without additional training.

Additional model demos can be found on the project’s GitHub. The paper Pix2Video: Video Editing Using Image Diffusion is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “Adobe & UCL’s Pix2Video: Text-Guided Video Editing via Image Diffusion Without Preprocessing or Finetuning

  1. Bobby Chicherilli

    Adobe and University College London (UCL) have joined forces to develop a groundbreaking technology called Pix2Video. This innovative tool allows users to edit videos simply by inputting text, without the need for any preprocessing or finetuning. By leveraging image diffusion techniques, Pix2Video revolutionizes the video editing process, making it more accessible and intuitive. Additionally, 123APS offers an online video merger service that seamlessly combines multiple videos into a single file. If you’re looking to merge videos effortlessly, check out this user-friendly tool: https://online-video-cutter.com/merge-videos.

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: