AI Computer Vision & Graphics Machine Learning & Data Science Research

NLPR, SenseTime & NTU Accelerate Automatic Video Portrait Editing

Researchers from Beijing’s National Laboratory of Pattern Recognition (NLPR), SenseTime Research, and Nanyang Technological University have taken the tech one step further with a new framework that enables totally arbitrary audio-video translation.

Video Portrait Editing techniques are already finding applications in TV, video and filmmaking — and are expected to play a key role in evolving telepresence scenarios. State-of-the-art methods can already realistically synthesize same-source audio to video. Now, researchers from Beijing’s National Laboratory of Pattern Recognition (NLPR), SenseTime Research, and Nanyang Technological University have taken the tech one step further with a new framework that enables totally arbitrary audio-video translation.

In developing the project, researchers faced a number of challenges:

  • How to perform direct mapping from audio to video without source video
  • How to generalize facial expressions among different speakers on the same audio clip
  • How to maintain video background integrity and clarity against occlusions etc. caused by speakers’ head movement
Screenshot 2020-01-23 14.39.36.png
System architecture overview

To increase the realism of their synthesized videos the researchers combined a number of different models and networks. On the video side, they applied a parametric 3D face model to extract face geometry, pose, and expression parameters from each portrait frame. On the audio side, they used an audio-to-expression translation network to identify specific audio features and match them with facial expressions.

The researchers also designed an audio ID-removing network to lower differentiation on different portraits. The source and target parameters were then modified with restructured 3D facial meshes, creating a masked portrait. Lastly, researchers applied a neural video rendering network to enable clear and uninterrupted background scenes.

Screenshot 2020-01-23 17.57.10.png
Audio-to-expression network architecture

The one-to-many and many-to-one translation test results showed the proposed system’s generalizing ability produced significantly more natural appearance and movements than state-of-the-art methods.

Screenshot 2020-01-24 01.48.42.png
Comparison with four major state-of-the-art methods.

The first author of this paper is Linsen Song, a graduate student under the guidance of NLPR researcher Ran He and former SenseTime intern. A video demonstration and interpretation of the synthesized results can be viewed on the project page.

The associated paper Everybody’s Talkin’: Let Me Talk as You Want is on arXiv.


Author: Reina Qi Wan | Editor: Michael Sarazen

0 comments on “NLPR, SenseTime & NTU Accelerate Automatic Video Portrait Editing

Leave a Reply

Your email address will not be published.

%d bloggers like this: