Video Portrait Editing techniques are already finding applications in TV, video and filmmaking — and are expected to play a key role in evolving telepresence scenarios. State-of-the-art methods can already realistically synthesize same-source audio to video. Now, researchers from Beijing’s National Laboratory of Pattern Recognition (NLPR), SenseTime Research, and Nanyang Technological University have taken the tech one step further with a new framework that enables totally arbitrary audio-video translation.
In developing the project, researchers faced a number of challenges:
- How to perform direct mapping from audio to video without source video
- How to generalize facial expressions among different speakers on the same audio clip
- How to maintain video background integrity and clarity against occlusions etc. caused by speakers’ head movement

To increase the realism of their synthesized videos the researchers combined a number of different models and networks. On the video side, they applied a parametric 3D face model to extract face geometry, pose, and expression parameters from each portrait frame. On the audio side, they used an audio-to-expression translation network to identify specific audio features and match them with facial expressions.
The researchers also designed an audio ID-removing network to lower differentiation on different portraits. The source and target parameters were then modified with restructured 3D facial meshes, creating a masked portrait. Lastly, researchers applied a neural video rendering network to enable clear and uninterrupted background scenes.

The one-to-many and many-to-one translation test results showed the proposed system’s generalizing ability produced significantly more natural appearance and movements than state-of-the-art methods.

The first author of this paper is Linsen Song, a graduate student under the guidance of NLPR researcher Ran He and former SenseTime intern. A video demonstration and interpretation of the synthesized results can be viewed on the project page.
The associated paper Everybody’s Talkin’: Let Me Talk as You Want is on arXiv.
Author: Reina Qi Wan | Editor: Michael Sarazen
0 comments on “NLPR, SenseTime & NTU Accelerate Automatic Video Portrait Editing”