Like the lipreading spies of yesteryear peering through their binoculars, almost all visual speech recognition VSR research these days focuses on mouth and lip motion. But a new study suggests that VSR models could perform even better if they used additional available visual information.
The VSR field typically looks at the mouth region since it is believed that lip shape and motion contain almost all the information correlated with speech. This has made the information in other facial regions considered as weak by default. But a new paper from the Key Laboratory of Intelligent Information Processing of the Chinese Academy of Sciences and the University of Chinese Academy of Sciences proposes that information from extraoral facial regions can consistently benefit SOTA VSR model performance. The research is supported by several psychology studies that suggest human gaze behaviour during visual and audio-visual speech perception involves repetitive transitioning between the eyes and the mouth.
To examine the impact of inputting additional facial regions to VSR models, researchers first needed to make fundamental changes to the data system. Since almost all current VSR researchers crop regions-of-interest (RoIs) around the mouth after locating face bounding boxes and landmarks, the team modified the convention by training baselines on a variety of sub-face RoIs: the whole face, the upper face (including the eyes), the cheeks, and the mouth.
Researchers trained the SOTA deep VSR models on large-scale, “in-the-wild” VSR datasets, which are more complex and challenging than scripted speech datasets that are designed and constructed with research questions in mind. In-the-wild data can also reflect real-world scenarios such as different poses, lighting, scale, makeup, expressions, etc. By observing the networks when exposed to such complex data, researchers could determine whether and how SOTA VSR networks might benefit from extraoral region information.
Researchers trained and evaluated models on three VSR benchmarks that include tonal and atonal languages and in-the-wild and scripted speeches: the Lip Reading in the Wild (LRW) dataset, the recent LRW-1000 dataset, and the GRID audiovisual corpus. The goal was to quantitatively estimate the contribution of different facial regions to a VSR task.
The team proposed Cutout, a popular CNN image regularization technique, as a practical approach to encourage the models to use all facial regions. In this way, researchers could analyze which RoI would be appropriate to draw from, and also increase the model’s robustness against occlusion within the mouth and other facial regions. The paper explains that Cutout augments the dataset with partially occluded versions of the dataset samples, an ‘adversarial erasing’ process designed to help the model pay more attention to less significant motion signals related to speech in extraoral regions.
Experiment results indicate the simple Cutout augmentation with aligned face inputs can significantly improve speech recognition performance since the model is now forced to learn the less valuable extraoral cues from data. The upper face and cheeks were identified as particularly beneficial regions.
The new approach’s demonstrated improvements over current SOTA methods that use only lip regions as RoIs could inspire researchers to seek ways to expand their data choices. The improved performance can also provide insights on other speech-related vision tasks such as audiovisual speech enhancement and realistic talking face generation. The research team says future work in this area could include tweaks on input resolution and automatic face alignment and expanding contextual clues to boost sentence-level performance.
The paper Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition is on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen