Speech-to-text applications have never been so plentiful, popular or powerful, with researchers’ pursuit of ever-better automatic speech recognition (ASR) system performance bearing fruit thanks to huge advances in machine learning technologies and the increasing availability of large speech datasets.
Current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance. However, a lack of transcribed audio data for the less widely spoken of the world’s 7,000 languages and dialects makes it difficult to train robust speech recognition systems in this area.
To help ASR development for such low-resource languages and dialects, Facebook AI researchers have open-sourced the new wav2vec 2.0 algorithm for self-supervised language learning.

The paper Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations claims to “show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.” A Facebook AI tweet says the new algorithm can enable automatic speech recognition models with just 10 minutes of transcribed speech data.
In tests, wav2vec 2.0 outperformed the current SOTA speech recognition method Noisy Student on a 100-hour subset of large-scale corpus of read-English speech Librispeech, even when the amount of labelled data was lowered to one hour.
What makes the wav2vec 2.0 so powerful?
Facebook AI researchers believe learning good representations of speech is the key to success. “Learning purely from labeled examples does not resemble language acquisition in humans: infants learn language by listening to adults around them – a process that requires learning good representations of speech.” To this end, researchers designed a framework for self-supervised learning of representations from raw audio data. By encoding speech audio via a multi-layer convolutional neural network and then masking spans of the resulting latent speech representations, researchers can feed the latent presentations to a Transformer network to build representations capturing information from the entire sequence.
In this way, the new model is trained to predict the correct speech unit for masked parts of the audio, while also learning what the speech units should be. This design allows the model to build the context representations over continuous speech representations and the dependencies over the entire sequence of latent representations captured.
The framework essentially leads to more robust training for the model to better understand the raw waveforms associated with speech.


The wav2vec 2.0 enabled speech recognition models achieved SOTA performance with a word error rate (WER) of 8.6 percent on noisy speech and 5.2 percent on clean speech on the standard LibriSpeech benchmark. It used just 10 minutes of transcribed speech, aka labelled data, for fine-tuning with pretraining on 53k hours of unlabeled data.

Facebook AI believes the new wav2vec 2.0 self-supervised algorithm can enable speech recognition models to be built with very small amounts of annotated data and still perform with excellent accuracy. This would benefit many low-resource and under-represented languages and dialects on speech recognition tasks and enable a wide range of associated applications. Facebook AI is currently adapting the wav2vec 2.0 implementation to run on Cloud TPUs.
The paper Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations is on arXiv.
Reporter: Fangyu Cai | Editor: Michael Sarazen

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: Facebook AI Wav2Vec 2.0: Automatic Speech Recognition from 10 Minute Sample – Full-Stack Feed
Pingback: [R] Facebook AI Wav2Vec 2.0: Automatic Speech Recognition From 10 Minute Sample – tensor.io
Pingback: NeurIPS 2020 | Conference Watch on Self-Supervised Learning | Synced
Pingback: BENDR for BCI: UToronto’s BERT-Inspired DNN Training Approach Learns From Unlabelled EEG Data | Synced