Facebook AI researchers and engineers just made live video content more accessible by enabling automatic closed captions for Facebook Live and Workplace Live, the company announced today.



The COVID-19 pandemic has caused a spike in the supply of and demand for public health information. Around the world, people have been streaming newscasts and conferences on Facebook Live for longer than usual, with the total number of Facebook Live broadcasts from Pages in June 2020 doubling compared to last year.



Thanks to Facebook AI’s ongoing advances in automated speech recognition (ASR), Facebook Live automatic captions can help governments efficiently disseminate crucial public health information and ensure that millions of viewers with hearing impairments also get the message. Automatic captioning is also helping employers keep all their staff and customers informed.



Six languages are supported in the new initiative: English, Spanish, Portuguese, Italian, German and French.



Although automated captioning technologies that predict a sequence of words from a raw audio signal have been around since the late 2000s, this remains a difficult task, especially when conversations take place in livestreams, where people don’t always speak clearly or wait their turn to speak. Other issues such as unpredictable background noise and the wide range of accents and tones in human speech make ASR even more challenging.



Conventional ASR systems are generally made up of three components: an acoustic model that predicts phonemes from short segments of audio, a pronunciation lexicon which describes how the phonemes are combined to form the words in a given language, and a language model that captures the relationships among those words.



Facebook engineers have deployed their model variations with a number of infrastructure optimizations to handle the additional livestream traffic while also reducing the compute required despite the increased load.



Although the system was trained on many different types of speech, it’s still far from perfect, particularly when it comes to accents. As it’s difficult to collect sufficient training data for every accent type, Facebook researchers are now exploring ways to improve their models by having them also learn from the vast amounts of unlabelled audio that is available online.

Reporter: Yuan Yuan | Editor: Michael Sarazen

