Google’s Universal Speech Model Scales Automatic Speech Recognition to 100+ Languages

In the new paper Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages, Google introduces the Universal Speech Model (USM), a scalable self-supervised training framework that extends automatic speech recognition to more than 100 languages.

Google’s error-marred unveiling of its Bard chatbot in Paris last month was disappointing, to say the least — but don’t count the tech giant out of the AI language model race just yet. Google bounced back this week, taking a big step forward on a project it launched last November: the 1,000 Languages Initiative, which aims to build a universal model that supports the world’s 1,000 most-spoken languages.

In the new paper Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages, a Google team “explores the frontiers of language expansion,” proposing a scalable self-supervised training framework for multilingual ASR (automatic speech recognition) that extends to hundreds of languages. Their resulting Universal Speech Models (USM) achieve state-of-the-art performance on multilingual ASR and speech-to-text translation tasks.

The team summarizes their main contributions as follows:

  1. We demonstrate that USMs pretrained on 300 languages can successfully adapt to both ASR and AST (automatic speech translation) tasks in new languages with a small amount of supervised data.
  2. We build a generic ASR model on 73 languages by fine-tuning pretrained models on 90k hours of supervised data. We show that the generic ASR models can carry out inference efficiently on TPUs and can reliably transcribe hours-long audio on YouTube Caption ASR benchmarks.
  3. We conduct a systematic study on the effects of pretraining, noisy student training, text injection, and model size for multilingual ASR.

The team uses a convolution-augmented transformer that Google introduced in 2020, the Conformer, as their backbone model. The USM training process uses 12 million hours of speech and 28 billion sentences of text spanning 300+ languages in a pipeline comprising three steps: 1) The Conformer is pretrained on the YT-NTL-U large unlabelled multilingual speech dataset using BERT-based speech pretraining with a random-projection quantizer (BEST-RQ), 2) Multi-objective supervised pretraining is applied to optimize multiple objectives with an RNN-T decoder on unlabelled text, and 3) The pretrained encoder is fine-tuned for downstream ASR and AST tasks.

The team evaluated USM performance on ASR and AST tasks in their empirical study. USM models achieved state-of-the-art ASR results on the FLEURS benchmark across 102 languages and for AST on the CoVoST-2 speech translation corpus of 21 languages. The researchers note that the USM training process can effectively adapt to new languages and data; and regard USM development as an essential step toward realizing “Google’s mission to organize the world’s information and make it universally accessible.”

The paper Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

