Computer Vision & Graphics Machine Learning & Data Science Popular

NVIDIA Neural Talking-Head Synthesis Makes Video Conferencing 10x More Bandwidth Efficient

The approach dramatically reduces bandwidth requirements by sending only a keypoint representation [of faces] and reconstructing the source video on the receiver side with the help of generative adversarial networks (GANs) to synthesize the talking heads.

Whether it’s for a business meeting, online learning, or catching up with the cousins and grandma, the use of video conferencing applications has surged in this year’s COVID-19 environment. A new report from Grand View Research projects the global video conferencing market will top US$6.7 billion by 2025.

To meet the demand for high-quality video conferencing, in October, tech giant Nvidia rolled out its fully accelerated Maxine software development kit for video conferencing services. Maxine is designed to help developers build and deploy AI-powered features in their applications without creating huge corresponding resource requirements. Nvidia boasts that “video conferencing applications based on Maxine can reduce video bandwidth usage down to one-tenth of H.264 using AI video compression, dramatically reducing costs.” (H.264 is the current industry-standard video format for encoding and decoding video signals, as it allows transmission of high-quality video signals without excessive demand on bandwidth.)

Most people who make video calls will have experienced occasional break-ups, jitters and freezes, etc. These frustrating phenomena usually result from the heavy bandwidth demands of the video conferencing app. Users of course would like to have consistently smooth video calls no matter the state of their Internet connection or whether they’re using a powerful desktop computer or a low-end phone or tablet. But how?

In the new paper One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing, Nvidia researchers detail a novel AI-based video compression technology solution that’s earning praise across the ML community. The approach dramatically reduces bandwidth requirements by sending only a keypoint representation [of faces] and reconstructing the source video on the receiver side with the help of generative adversarial networks (GANs) to synthesize the talking heads.

Current video calling systems typically transmit a compressed video signal comprising massive streams of pixel-packed images via participants’ Internet connections (which often cannot handle the load). The Nvidia approach restricts the transmitted data to only a few keypoint locations around the caller’s eyes, nose, and mouth.

The proposed system first extracts appearance features and 3D canonical keypoints from the source image. These are used to compute source keypoints and generate keypoints for the synthesis videos. The system decomposes the keypoint representations into person-specific canonical keypoints and motion-related transformations, using the 3D keypoints to model both facial expressions and geometric signature to create a talking-head synthesis video with expression and head pose information. The rendering technique can also synthesize associated accessories in the source video, such as eyeglasses, hats, and scarves.

Of course, nobody stays still in a video call — are users able to naturally nod, rotate or otherwise move their heads without comprising synthesis results? Yes. The researchers included a pretrained face recognition network and a pretrained head pose estimator to ensure that head poses and angles etc. in the generated images are accurate and visually acceptable.

The team examined the proposed method on talking-head synthesis tasks such as video reconstruction, motion transfer, and face redirection, where it outperformed methods such as FOMM, few-shot vid2vid (fs-vid2vid), and bi-layer neural avatars (bilayer) on benchmark datasets.

Reaction from the AI community has also been very positive. Ian Goodfellow — the renowned research scientist who pioneered generative adversarial networks (GANs) — sent kudos to the team on their success: “This is really cool. Some of my PhD labmates worked on ML for compression back in the pretraining era, and I remember it being really hard to get a compression advantage.”

The paper One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing is on arXiv.


Reporter: Fangyu Cai | Editor: Michael Sarazen


B4.png

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon KindleAlong with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.


AI Weekly.png

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

3 comments on “NVIDIA Neural Talking-Head Synthesis Makes Video Conferencing 10x More Bandwidth Efficient

%d bloggers like this: