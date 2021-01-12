Even in a noisy crowd, the human perceptual system can effectively reduce auditory ambiguities to identify and isolate an active speaker — an action performed in large part by leveraging visual information. Recent AI research on speech separation has explored ways to associate lip motions in videos with audio, but this approach suffers when speakers’ lips are occluded, which they often are in busy multi-speaker environments.

Inspired by work in the cognitive sciences, a team from the University of Texas at Austin and Facebook AI Research has introduced an approach that takes as its input video of a target speaker in an environment with overlapping voices or sounds and generates an isolated soundtrack of the speaker. VisualVoice is a novel multi-task learning framework that jointly learns audio-visual speech separation together with cross-modal speaker embeddings, effectively using a person’s facial appearance to predict their vocal sounds.

The researchers explain that attributes such as gender, age, nationality and body weight — which present in the face — can provide a prior for sound qualities such as tone, pitch, timbre and basis of articulation. A model can use this to learn what to listen for to more accurately identify and separate an individual’s speech from a noisy environment. The network uses facial appearance, lip motion and vocal audio to perform this separation task, which augments the conventional “mix-and-separate” paradigm for audio-visual separation to also account for a cross-modal contrastive loss requiring the separated voice to agree with the face. A cost-reducing feature of the proposed method is that it can be trained and tested using unlabelled video.

The approach was evaluated on five benchmark datasets for audio-visual speech separation, speech enhancement and cross-modal speaker verification, using standard metrics such as Signal-to-Distortion Ratio (SDR), Signal-to-Interference Ratio (SIR) and Signal-to-Artifacts Ratio (SAR), and two speech-specific metrics: Perceptual Evaluation of Speech Quality (PESQ), which measures the overall perceptual quality of the separated speech, and Short-Time Objective Intelligibility (STOI), which is correlated with the intelligibility of the signal.

VisualVoice excelled in audio-visual speech separation and enhancement in challenging real-world videos, outperforming SOTA methods on all metrics across all datasets. The researchers say the embedding learned by their model also improved the SOTA for unsupervised cross-modal speaker verification.



Speech separation has practical applications in assistive technology for the hearing impaired, wearable AR devices, speech-to-text in noisy videos and more. In future work, the researchers say they plan to explicitly model the fine-grained cross-modal attributes of faces and voices, and leverage these to further enhance speech separation.



The paper VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency is on arXiv.

Analyst: Reina Qi Wan | Editor: Michael Sarazen; Fangyu Cai

