Driven by the development of powerful machine learning models and the availability of large-scale web-mined datasets, the performance of academic and commercial machine translation (MT) systems has significantly improved in recent years. These systems however are generally restricted to fewer than 100 mainstream languages, a small fraction of the over 7000+ languages currently spoken globally.
In his influential 2004 Wired article The Long Tail and subsequent book, Chris Anderson argues that the combined appeal of many niche products could eclipse that of the bestselling books and blockbuster movies that dominate the market. The ever-deepening libraries of today’s online booksellers and music and video streaming platforms seem to have confirmed this. Could we see a similar trend emerging in MT?
A Google Research team takes inspiration from the long-tail theory in their new paper Building Machine Translation Systems for the Next Thousand Languages, which proposes a practical MT system that can translate over 1,000 languages.
The team summarizes their study’s aims and contributions as:
- Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pretraining for language identification and developing data-driven filtering techniques.
- Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised parallel data for over 100 high-resource languages and monolingual datasets for an additional 1000+ languages.
- Studying the limitations of evaluation metrics for these languages and conducting a qualitative analysis of the outputs from our MT models, highlighting several frequent error modes of these types of models.
The team notes that despite high speaker populations, many languages spoken in Africa, South and South-East Asia and indigenous languages of the Americas remain relatively under-served by today’s MT systems, which tend to focus on European tongues. Google Translate, for example, supports Maltese, Icelandic, and Corsican, each with fewer than 1M Level 1 speakers, but not Bhojpuri (~51M speakers), Oromo (~24M speakers), or Quechua (~9M speakers).
To address this underrepresentation, the researchers first build monolingual web text corpora for such languages. They scale LangID (language identification) models to 1000+ languages by leveraging traditional n-gram models and semi-supervised learning approaches, then use these LangID models to identify and extract long-tail (aka low-resource) language data from the web.
With this mined monolingual data at hand, the team then builds general-domain MT models by exploiting the parallel data available for higher resource languages. The team refers to this setup as zero-resource since no direct supervision is available for the long-tail languages. To boost the quality of zero-resource translation for long-tail languages, the researchers leverage recently developed MT techniques such as self-supervised learning from monolingual data, massively multilingual supervised training, large-scale back-translation and self-training, and high capacity models.
In their empirical study, the researchers used their models to translate English sentences into 38 long-tail languages to evaluate its zero-resource translation capability and measured performance using the character-level chrF metric (Popovic, 2015) and human evaluations. They observed significant quality improvements, confirming the effectiveness of their proposed approach for building practical and effective MT systems for long-tail languages.
The paper Building Machine Translation Systems for the Next Thousand Languages is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.