Google’s Universal Speech Model Scales Automatic Speech Recognition to 100+ Languages

Google’s error-marred unveiling of its Bard chatbot in Paris last month was disappointing, to say the least — but don’t count the tech giant out of the AI language model race just yet. Google bounced back this week, taking a big step forward on a project it launched last November: the 1,000 Languages Initiative, which aims to build a universal model that supports the world’s 1,000 most-spoken languages.

In the new paper Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages, a Google team “explores the frontiers of language expansion,” proposing a scalable self-supervised training framework for multilingual ASR (automatic speech recognition) that extends to hundreds of languages. Their resulting Universal Speech Models (USM) achieve state-of-the-art performance on multilingual ASR and speech-to-text translation tasks.

The team summarizes their main contributions as follows:

We demonstrate that USMs pretrained on 300 languages can successfully adapt to both ASR and AST (automatic speech translation) tasks in new languages with a small amount of supervised data.
We build a generic ASR model on 73 languages by fine-tuning pretrained models on 90k hours of supervised data. We show that the generic ASR models can carry out inference efficiently on TPUs and can reliably transcribe hours-long audio on YouTube Caption ASR benchmarks.
We conduct a systematic study on the effects of pretraining, noisy student training, text injection, and model size for multilingual ASR.

The team uses a convolution-augmented transformer that Google introduced in 2020, the Conformer, as their backbone model. The USM training process uses 12 million hours of speech and 28 billion sentences of text spanning 300+ languages in a pipeline comprising three steps: 1) The Conformer is pretrained on the YT-NTL-U large unlabelled multilingual speech dataset using BERT-based speech pretraining with a random-projection quantizer (BEST-RQ), 2) Multi-objective supervised pretraining is applied to optimize multiple objectives with an RNN-T decoder on unlabelled text, and 3) The pretrained encoder is fine-tuned for downstream ASR and AST tasks.

The team evaluated USM performance on ASR and AST tasks in their empirical study. USM models achieved state-of-the-art ASR results on the FLEURS benchmark across 102 languages and for AST on the CoVoST-2 speech translation corpus of 21 languages. The researchers note that the USM training process can effectively adapt to new languages and data; and regard USM development as an essential step toward realizing “Google’s mission to organize the world’s information and make it universally accessible.”

The paper Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

4 comments on “Google’s Universal Speech Model Scales Automatic Speech Recognition to 100+ Languages”

dino game

2023-03-10

100 languages – this is a big number 🙂

Loading...

Pro Maskad

2023-03-13

Maskad offers best Facial aesthetic procedure post care for your specific skin type and procedure. When it comes to selecting a post-procedure face mask, there are a few things to keep in mind. After a facial treatment or any other cosmetic procedure, your skin may be more sensitive than usual and may require some extra care and attention. A good post procedure face mask should be gentle on the skin, hydrating, and nourishing.

Loading...

DD Mittal Towers in Bathinda

2023-03-14

Mittal Group is a group of real estate developers who are based out of Bathinda, Punjab. They specialize in commercial properties, apartments, and villas, and they have the knowledge to help you with all your property needs.

Loading...

semenmarqus

2025-09-30

Replacements That Restore Confidence

Missing or damaged teeth can do more than affect appearance—they impact chewing, speech, and jaw health. The implant, bridge, and denture solutions at Shoreline Dental Studio bring back both function and esthetics. They don’t just replace a tooth; they help you reclaim confidence and durability. https://www.shorelinedentalstudio.com/

Loading...

Google’s Universal Speech Model Scales Automatic Speech Recognition to 100+ Languages

Like this:

4 comments on “Google’s Universal Speech Model Scales Automatic Speech Recognition to 100+ Languages”

Leave a Reply Cancel reply

Related

Share this:

Like this:

4 comments on “Google’s Universal Speech Model Scales Automatic Speech Recognition to 100+ Languages”

Leave a Reply Cancel reply

Related