AI-powered speech recognition systems have made great progress in recent years, with speech-to-text processing now so powerful that the occasional errors are little more than curious exceptions. Most contemporary models addressing this task however require massive labelled training data — which is simple enough to source for English, Chinese, and other popular languages but challenging to obtain in the case of the low-resource tongues that make up the majority of the world’s 8,000 languages.
To address this issue, a Carnegie Mellon University research team has developed a speech recognition pipeline that can recognize 1909 languages without any audio for the target language. Their ASR2K pipeline achieves impressive 45 percent CER and 69 percent WER scores when using 10,000 raw text utterances on the CMU Wilderness dataset, and is introduced in the paper ASR2K: Speech Recognition for Around 2000 Languages Without Audio.
The proposed pipeline comprises separate acoustic, pronunciation, and language models. The acoustic model is used to recognize phonemes of the target languages, including unseen languages. The pronunciation model is a grapheme-to-phoneme (G2P) model that predicts the phoneme pronunciation given a grapheme sequence. Both the acoustic and pronunciation models can first be trained using supervised datasets from high-resource languages and will then apply their learned linguistic knowledge to low-resource languages without supervision.
The team uses raw text datasets or n-gram statistics to build the ASR2K language model. Each word’s pronunciation is approximated using the pronunciation model, and this information is encoded into a lexicon graph. The text dataset also enables the model to estimate a classical n-gram language model by counting the n-gram statistics. This language model is then combined with the pronunciation model to develop a weighted finite-state transducer (WFST) decoder.
In their empirical study, the team applied the proposed method to 1909 languages on the Crúbadán large endangered languages n-gram database and tested it on 34 languages from the Common Voice dataset and 95 languages from the CMU Wilderness Multilingual Speech dataset.
In the evaluations, the proposed ASR2K pipeline achieved 50 percent CER (character error rate) and 74 percent WER (word error rate) scores using Crúbadán’s statistics only; and reached 45 percent CER and 69 percent WER when using 10,000 raw text utterances.
The researchers believe theirs is the first attempt to build a speech recognition pipeline for thousands of languages without audio. The paper and associated code will be published at the 23rd INTERSPEECH Conference, which runs from September 18 to 22 in Incheon, South Korea.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.