There are now dozens of AI-powered text-to-image models on the Internet, with DALL.E Mini alone generating more than 50,000 images daily from users’ natural language prompts. The next challenging step for such generative AI models is text-to-video — which brings the potential for creating animated scenes based on users’ storytelling inputs. While current text-to-video approaches guided by Open AI’s CLIP network can translate text into highly-realistic imagery, they are slow — requiring from 17 seconds to five minutes to generate a single frame of video.
In the new paper Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization, researchers from Carnegie Mellon University leverage CLIP-guided, pixel-level optimization to generate 720p resolution videos from natural language descriptions at a rate of one-to-two frames per second — taking a big step towards a real-time text-to-video system.

Existing CLIP-Guided text-to-video approaches generate their highly realistic imagery by optimizing through large pretrained image generator diffusion models, a process that is both time-consuming and computationally heavy.
The CMU team employs a novel two-step approach to approach real-time text-to-video generation: 1) Generating noisy semantic content at a fast speed; and 2) Refining the generated image textures in a post-processing step.

The proposed approach generates each frame sequentially while iterating through the input language to guide the content. CLIP-Guided techniques are used to compare the frame and the language description and evolve the frame toward consistency with the content. A trained CycleGAN model smooths and denoises the generated images.


In their empirical study, the team demonstrated their approach’s ability to generate realistic videos at up to 720p resolution at speeds 20-300 times faster than existing methods.
The code and sample videos are available on the team’s website. In future work, the researchers plan to add priors to enable smoother motion in the videos and improve user control over their style and appearance.
The paper Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “CMU Takes a Big Step Toward Real-Time Realistic Video Generation Based on Language Descriptions”