Adobe & UCL’s Pix2Video: Text-Guided Video Editing via Image Diffusion Without Preprocessing or Finetuning

Image diffusion models have emerged as a game-changing method for generating high-quality images that can be edited via users’ natural language text prompts. However, applying image diffusion models to video editing produces inconsistent results, as they struggle with preserving source video content and temporal coherence across frames.

A research team from Adobe Research and University College London addresses these issues in the new paper Pix2Video: Video Editing Using Image Diffusion, introducing Pix2Video, a novel framework that employs a pretrained image diffusion model to enable faithful and realistic text-guided video editing without additional training.

The Pix2Video process comprises two simple steps: 1) A pretrained structure-guided image diffusion model performs text-guided edits on an anchor frame; 2) The changes are then progressively propagated to the future frame(s) via a self-attention feature injection approach where the self-attention layers perform cross-frame attention to enable consistent image generation.

Given a sequence of frames from a video clip, Pix2Video first inverts each frame using a denoising diffusion implicit model (DDIM) and considers it as the initial noise for the denoising process. A reference frame is then selected and its self-attention features injected into the UNet to edit each frame. At each diffusion step, the input features to the self-attention module are projected into queries, keys, and values to enable Pix2Video to capture global information, and the latent of the current frame is updated guided by the reference frame.

In their empirical study, the team compared Pix2Video with state-of-the-art image and video editing approaches such as Text2Live and SDEdit on the DAVIS video object segmentation dataset. In the evaluations, Pix2Video achieved performance on par or better than the baselines, demonstrating its ability to edit videos with either a clear foreground object or multiple foreground objects, better preserve the structure of the input video, and maintain a good balance between edit quality and consistency without additional training.

Additional model demos can be found on the project’s GitHub. The paper Pix2Video: Video Editing Using Image Diffusion is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

5 comments on “Adobe & UCL’s Pix2Video: Text-Guided Video Editing via Image Diffusion Without Preprocessing or Finetuning”

Bobby Chicherilli

2023-05-21

Adobe and University College London (UCL) have joined forces to develop a groundbreaking technology called Pix2Video. This innovative tool allows users to edit videos simply by inputting text, without the need for any preprocessing or finetuning. By leveraging image diffusion techniques, Pix2Video revolutionizes the video editing process, making it more accessible and intuitive. Additionally, 123APS offers an online video merger service that seamlessly combines multiple videos into a single file. If you’re looking to merge videos effortlessly, check out this user-friendly tool: https://online-video-cutter.com/merge-videos.

Loading...

Reply
Andrew

2024-10-07

Pix2Video demonstrated proficiency in editing videos featuring both clear foreground objects and multiple foreground elements, showcasing its versatility in various editing scenarios @slope run

Loading...

Reply
matzzz

2024-11-28

Adobe and UCL’s Pix2Video sounds like a game-changer for video editing, allowing text-guided edits without extensive preprocessing—pretty futuristic! For simpler, everyday needs, though, tools like https://www.movavi.com/image-converter/ are great. It’s beginner-friendly and perfect for quick edits like trimming, transitions, and overlays. While Pix2Video pushes boundaries, Movavi keeps things straightforward and accessible for those who don’t need advanced AI features.

Loading...

Reply
Traffic Road

2025-02-24

How well does Pix2Video handle fast-moving objects or complex scene dynamics where temporal consistency is more challenging to maintain?

Loading...

Reply
semenmarqus

2025-06-10

How to Transition to Gray Hair Gracefully
More people are embracing their natural gray, and the right approach makes the transition stylish and empowering. For color correction and silvery blends, consult with stylists at https://hairbarnyc.com.

Loading...

Reply

Adobe & UCL’s Pix2Video: Text-Guided Video Editing via Image Diffusion Without Preprocessing or Finetuning

Like this:

5 comments on “Adobe & UCL’s Pix2Video: Text-Guided Video Editing via Image Diffusion Without Preprocessing or Finetuning”

Leave a Reply Cancel reply

Related

Share this:

Like this:

5 comments on “Adobe & UCL’s Pix2Video: Text-Guided Video Editing via Image Diffusion Without Preprocessing or Finetuning”

Leave a Reply Cancel reply

Related