CMU & Google Extend Pretrained Models to Thousands of Underrepresented Languages Without Using Monolingual Data

While pretrained large language models continue to make dramatic progress on natural language processing (NLP) tasks, these models tend to focus on English and other popular international languages due to their widespread usage and the massive amounts of monolingual or parallel texts available in datasets and on the Internet. Unfortunately, most world languages do not fit these criteria, as they present relatively little or even no textual data.

To address this issue, a research team from Carnegie Mellon University and Google explores strategies for leveraging the under-studied resource of bilingual lexicons to adapt pretrained multilingual models to low-resource languages. Their resulting Lexicon-based Adaptation approach uses these lexicons to synthesize textual or labelled data, producing consistent performance improvements without the need for additional monolingual text.

Most existing methods for adapting pretrained multilingual models to low-resource languages rely on training with monolingual text from the target language, effectively limiting final model performance to the amount of publicly available text for training.

The researchers note that even the state-of-the-art pretrained multilingual BERT model (mBERT; Devlin et al., 2019) covers less than one percent of the world’s estimated 7,000 languages, and that large-scale sources such as Wikipedia and CommonCrawl include textual data from only about four percent of these tongues.

Bilingual lexicons (aka word lists) are language documentation tools traditionally used by historians and linguists that provide better coverage of resource-poor languages than the aforementioned sources. The researchers leverage
these bilingual lexicons to design a novel data augmentation framework.

Given a bilingual lexicon with source and target language pairs, synthetic sentences can be created in the target language from sentences in the source language via word-to-word translation. This synthetic data is then used for pseudo masked language model (MLM) and pseudo trans-train adaptation in both no-text and few-text settings.

The researchers also introduce two strategies for refining their synthetic data: label distillation, which can automatically “correct” the labels of pseudo data using a teacher model; and induced lexicons with parallel data, which leverages available parallel data to further improve the quality of the augmented data.

The team applied their approach to mBERT and evaluated it on three tasks: named entity recognition (NER), part-of-speech (POS) tagging, and dependency parsing (DEP).

The proposed approach boosted performance on the 19 underrepresented languages surveyed, producing consistent F1 score improvements of up to 5 and 15 points with and without extra monolingual text, respectively. Overall, the study shows that it is possible to make concrete progress toward including under-represented languages in the development of pretrained language models by utilizing alternative data sources.

The paper Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

CMU & Google Extend Pretrained Models to Thousands of Underrepresented Languages Without Using Monolingual Data

Like this:

0 comments on “CMU & Google Extend Pretrained Models to Thousands of Underrepresented Languages Without Using Monolingual Data”

Leave a Reply Cancel reply

Related

Share this:

Like this:

0 comments on “CMU & Google Extend Pretrained Models to Thousands of Underrepresented Languages Without Using Monolingual Data”

Leave a Reply Cancel reply

Related