Google researchers recently introduced mT5, a multilingual variant of the tech giant’s “Text-to-Text Transfer Transformer” (T5), pretrained on a new Common Crawl-based dataset covering 101 languages. As discussed in the Synced article Google T5 Explores the Limits of Transfer Learning, the T5 leverages a unified text-to-text format and scale to attain state-of-the-art results across a wide variety of English-language NLP tasks.
Current natural language processing (NLP) pipelines usually utilize transfer learning, wherein models are first pretrained on data-rich tasks before being fine-tuned on a downstream task of interest. Pretrained models such as T5 contribute to the success of this paradigm through the release of parameter checkpoints, making it possible for NLP practitioners to quickly attain strong performance on many tasks without needing to perform expensive pretraining themselves. Most of these language models however were pretrained solely on English-language text, which the researchers say limits their use for the 80 percent of the world population who does not speak English. The NLP community has responded by developing multilingual models pretrained on a mixture of many languages, such as mBERT and mBART.
Google’s massively multilingual mT5 accelerates this approach. The goal was to produce a massively multilingual model that would deviate as little as possible from the recipe used to create T5. The mT5 inherits and benefits from the T5’s general-purpose text-to-text format, its design based on insights from large-scale empirical studies, and its scale. The mT5 was trained on the mC4 101-language natural text dataset, a specially built multilingual variant of the C4 Dataset that comprises some 750GB of English-language text sourced from Common Crawl.
In their evaluations, the researchers compared mT5 with related models such as mBERT, XLM and XLM-R on five tasks from the Xtreme multilingual benchmark. The largest of the proposed models, mT5-XXL, reached SOTA performance on all the tasks.
The new work shows that the T5’s strengths can also be applied to a multilingual model environment and achieve SOTA performance on a diverse set of tasks. The results underscore the importance of model capacity in cross-lingual representation learning, suggesting that a strategy of scaling up a simple pretraining recipe could serve as a viable alternative to more technically complex current approaches.
The paper mT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer is on arXiv. The associated code and model checkpoints are available on the project GitHub.
Analyst: Yuqing Li | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.