AI Machine Learning & Data Science Nature Language Tech Research

CMU, DeepMind & Google’s XTREME Benchmarks Multilingual Model Generalization Across 40 Languages

XTREME, a multi-task benchmark that evaluates cross-lingual generalization capabilities of multilingual representations across 40 languages and nine tasks.

An essential step toward more general-purpose natural language processing (NLP) models is to achieve a certain level of multilingual competence. But considering most of the estimated 6,900 languages worldwide lack sufficient data to train robust models, existing NLP research and methods still tend to focus on specific English-language tasks.

Junjie Hu is a PhD student at the Language Technologies Institute of Carnegie Mellon University. Hu previously interned at Google, where he observed it was difficult to train cross-lingual models due to a lack of established and comprehensive tasks and environments for evaluating and comparing model performance on cross-lingual generalization.

Although recent multilingual approaches like mBERT and XLM have shown impressive results in learning general-purpose multilingual representations, a fair comparison between these models remains difficult as most evaluations focus on different sets of tasks designed for similar languages.

Hu and another CMU researcher together with DeepMind’s Sebastian Ruder and researchers from Google recently published a study introducing XTREME, a multi-task benchmark that evaluates cross-lingual generalization capabilities of multilingual representations across 40 languages and nine tasks. Hu told Synced “Hopefully XTREME can encourage more research efforts in building multilingual NLP models and effective human curations for multilingual resources.”

dn-414.png

Based on the researchers’ analysis, the cross-lingual transfer performance of current models varies significantly both between tasks and languages. To maximize language diversity for the benchmark, the team selected 40 languages from various language families and with diverse written scripts out of the 100 languages with the most Wikipedia articles.

“We also made sure to cover languages with low, medium, and high resource, in other words, find a balance between language diversity and resource availability,” Hu said. The research included under-studied languages such as the Dravidian language Tamil spoken in southern India, Sri Lanka, and Singapore; as well as Niger-Congo languages Swahili and Yoruba.

Hu says there is an ongoing effort to extend XTREME to cover up to 100 languages.

Each task covers a subset of the 40 languages, so in order for a model to succeed on the XTREME benchmark it needs to learn multilingual representations that summarize linguistic information at different levels and generalize to the diverse set of cross-lingual transfer tasks.

XTREME focuses on the zero-shot cross-lingual transfer scenario where annotated training data is provided in English but none is provided in the target language. To evaluate performance using XTREME, models must first be pretrained on multilingual text using objectives that encourage cross-lingual learning, then fine-tuned on task-specific English data. XTREME can then evaluate the models on zero-shot cross-lingual transfer performance, for example on other languages for which no task-specific training data was provided.

In experiments with state-of-the-art pretrained multilingual models such as mBERT, XLM, XLM-R, and M4, the researchers found that performance differences were most pronounced on syntactic and sentence retrieval tasks. While the multilingual models approached human level performance on many tasks in English and did reasonably well on languages in the Indo-European family, they struggled with Sino-Tibetan, Japonic, Koreanic, and Niger-Congo languages.

“Overall, a large gap between performance in English and other languages remains across all models and settings, which indicates that there is much potential for research on cross-lingual transfer,” Ruder and co-author Melvin Johnson wrote in a Google blog post on the paper.

Advanced techniques developed for English-language applications have dominated most of the recent and impressive NLP breakthroughs. Building on cross-lingual deep contextual representations, the researchers believe this new work can contribute to improving NLP performance for the 80 percent of humans who speak languages other than English.

The paper XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization is on arXiv, and the code is on GitHub.


Journalist: Yuan Yuan | Editor: Michael Sarazen

0 comments on “CMU, DeepMind & Google’s XTREME Benchmarks Multilingual Model Generalization Across 40 Languages

Leave a Reply

Your email address will not be published.

%d bloggers like this: