The realistic, high-resolution human portraits generated by AI systems in recent years have wowed the Internet. While speech-driven portrait animation techniques designed to bring these faces to life have emerged as a new focus in the research community, existing approaches tend to struggle with issues such as pose variations, emotional control, and facial landmarks.
In the new paper SPACEx: Speech-driven Portrait Animation with Controllable Expression, an NVIDIA research team introduces SPACEx — a speech-driven portrait animation framework with controllable emotional expression. The novel approach generates high-resolution and expressive videos with controllable subject pose, emotion and expression intensity; and achieves state-of-the-art performance for speech-driven portrait image animation.
The team summarizes their main contributions as follows:
- We achieve state-of-the-art quality for speech-driven portrait image animation. SPACEx provides better quality in terms of FID and landmark distances compared to previous methods while also generating higher-resolution output videos.
- Our method can produce realistic head poses while also being able to transfer poses from another video. It also offers increased controllability by utilizing facial landmarks as an intermediate stage, allowing manipulations such as blinking, eye gaze control, etc.
- For the same set of inputs, our method allows for the manipulation of the emotion labels and their corresponding intensities in the output video.
SPACEx takes a speech clip and a single face image as input, and users can further condition the output video by adding an emotion label (happy, sad, surprise, fear). The framework decomposes tasks into three stages to improve interpretability and fine-grained controllability. In the first stage, Speech2Landmarks, the model predicts facial landmark motions in a normalized space based on the input image, speech and emotion label. In the second, Landmarks2Latents stage, the per-frame posed facial landmarks are translated into latent keypoints. Given the input image and per-frame posed facial landmarks, a pretrained image-based facial animation model, face-vid2vid, outputs an animated 512×512 resolution video in the final synthesis stage.
The team introduces a novel emotion conditioning technique that uses feature-wise linear modulation layers (FiLM, Perez et al., 2017) to enable controllable emotional expression and intensities in the generated video. In the Speech2Landmarks network, FiLM modulates the audio features and the initial landmark input. In the Landmarks2Latents network, FiLM is applied on the audio, landmarks, and the initial latent keypoint input.
In their empirical study, the team compared SPACEx to Wav2Lip, MakeItTalk, and Talking Face-PC-AVS (PC-AVS) baselines on various talking-head video generation tasks. In the evaluations, SPACEx achieved the best performance in terms of FID and landmark distances while generating higher resolution output videos with superior facial expression controllability.
The team believes SPACEx’s capabilities could open promising new avenues in video conferencing, gaming, and media synthesis.
Sample videos and additional information are available on the project website: https://deepimagination.cc/SPACEx/. The paper SPACEx: Speech-driven Portrait Animation with Controllable Expression is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.