It only takes a single click on the “See Translation” button below a Facebook post or comment to view the content in one’s preferred language. The Facebook News Feed alone delivers 20 billion such translations daily thanks to the social media platform’s high-performance multilingual machine translation (MMT) technologies.
However, although there are thousands of languages spoken today, MMT research has so far had to rely exclusively on English-centric datasets and models. This means for example a Chinese to French translation model has to train on both Chinese to English and English to French data. To better preserve meanings and improve accuracy, Facebook AI yesterday open-sourced an MMT model that directly trains on Chinese to French and other language-pair data.
Facebook is hailing the new M2M-100 (Many-to-Many) model as a “major milestone” in their years of work in MMT and the first AI model that can directly translate between any pair of 100 languages without relying on any English data. In quality evaluations, when translating between non-English languages using the popular BLEU (Bilingual Evaluation Understudy) metric, M2M-100 achieved a 10 BLEU point improvement over English-centric multilingual models.
“This English-Centric bias in the data and resulting models is not reflective of how people use translation and empirically leads to lower performance for non-English translation directions,” reads the paper Beyond English-Centric Multilingual Machine Translation.
Current advanced MMT systems can process multiple languages at once despite the trade-off on accuracy with English data serving as a bridge to connect the source and target languages. But how to develop a model that can translate from 100 languages to 100 languages — or 9,900 directions? You do so, explains Facebook AI, “by building a large-scale Many-to-Many dataset for 100 languages.”
The researchers grouped the 100 languages into 14 language groups based on linguistic classification, geography, and cultural similarities. A small number of bridge languages were also identified to connect different groups of languages. For instance, Hindi, Bengali, and Tamil serve as bridge languages for many lesser-spoken Indo-Aryan languages. Parallel training data was mined for all combinations involving these bridge languages, employing large-scale mining strategies such as ccAligned, ccMatrix, and LASER. Facebook says the result is the “most diverse many-to-many MMT data set to date: 7.5 billion sentence pairs.“
With the Many-to-Many dataset composed, how to scale the number of parameters was the next challenge. The researchers say they leveraged progress in scaling to train models over 50 times larger than current bilingual models with model parallelism. They combined dense and sparse language-specific parameters to scale capacity to a size of 15.4 billion parameters on models that can still be trained efficiently “on hundreds of GPUs.”
Not everyone can afford to train models on hundreds of GPUs though, and so luckily for those resource-challenged researchers Facebook is open-sourcing the model and all the training data.
“There’s a lot of exciting work going on in AI these days,” Facebook CTO Mike Schroepfer tweeted, “but I’m particularly excited about this work. We are getting closer and closer to the “universal translator” from Star Trek and most importantly this work has the biggest impact on languages with the least amount of content on the internet. A good example of how technology can fundamentally lower barrier to access.”
Reporter: Fangyu Cai | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.