NLPR, SenseTime & NTU Accelerate Automatic Video Portrait Editing

Synced

6 years ago

Video Portrait Editing techniques are already finding applications in TV, video and filmmaking — and are expected to play a key role in evolving telepresence scenarios. State-of-the-art methods can already realistically synthesize same-source audio to video. Now, researchers from Beijing’s National Laboratory of Pattern Recognition (NLPR), SenseTime Research, and Nanyang Technological University have taken the tech one step further with a new framework that enables totally arbitrary audio-video translation.

In developing the project, researchers faced a number of challenges:

How to perform direct mapping from audio to video without source video
How to generalize facial expressions among different speakers on the same audio clip
How to maintain video background integrity and clarity against occlusions etc. caused by speakers’ head movement

To increase the realism of their synthesized videos the researchers combined a number of different models and networks. On the video side, they applied a parametric 3D face model to extract face geometry, pose, and expression parameters from each portrait frame. On the audio side, they used an audio-to-expression translation network to identify specific audio features and match them with facial expressions.

The researchers also designed an audio ID-removing network to lower differentiation on different portraits. The source and target parameters were then modified with restructured 3D facial meshes, creating a masked portrait. Lastly, researchers applied a neural video rendering network to enable clear and uninterrupted background scenes.

*Audio-to-expression network architecture*

The one-to-many and many-to-one translation test results showed the proposed system’s generalizing ability produced significantly more natural appearance and movements than state-of-the-art methods.

*Comparison with four major state-of-the-art methods.*

The first author of this paper is Linsen Song, a graduate student under the guidance of NLPR researcher Ran He and former SenseTime intern. A video demonstration and interpretation of the synthesized results can be viewed on the project page.

The associated paper Everybody’s Talkin’: Let Me Talk as You Want is on arXiv.

Author: Reina Qi Wan | Editor: Michael Sarazen

Share this: