Machine Learning & Data Science Nature Language Tech

Google ‘mT5’ Pretrained Text-to-Text Transformer Achieves SOTA Performance on Multilingual Benchmarks

Google recently introduced mT5, a multilingual variant of its “Text-to-Text Transfer Transformer” (T5), pretrained on a new Common Crawl-based dataset covering 101 languages.

Google researchers recently introduced mT5, a multilingual variant of the tech giant’s “Text-to-Text Transfer Transformer” (T5), pretrained on a new Common Crawl-based dataset covering 101 languages. As discussed in the Synced article Google T5 Explores the Limits of Transfer Learning, the T5 leverages a unified text-to-text format and scale to attain state-of-the-art results across a wide variety of English-language NLP tasks.

Screen Shot 2020-10-26 at 12.26.15 AM.png

Current natural language processing (NLP) pipelines usually utilize transfer learning, wherein models are first pretrained on data-rich tasks before being fine-tuned on a downstream task of interest. Pretrained models such as T5 contribute to the success of this paradigm through the release of parameter checkpoints, making it possible for NLP practitioners to quickly attain strong performance on many tasks without needing to perform expensive pretraining themselves. Most of these language models however were pretrained solely on English-language text, which the researchers say limits their use for the 80 percent of the world population who does not speak English. The NLP community has responded by developing multilingual models pretrained on a mixture of many languages, such as mBERT and mBART.

Screen Shot 2020-10-26 at 12.26.33 AM.png

Google’s massively multilingual mT5 accelerates this approach. The goal was to produce a massively multilingual model that would deviate as little as possible from the recipe used to create T5. The mT5 inherits and benefits from the T5’s general-purpose text-to-text format, its design based on insights from large-scale empirical studies, and its scale. The mT5 was trained on the mC4 101-language natural text dataset, a specially built multilingual variant of the C4 Dataset that comprises some 750GB of English-language text sourced from Common Crawl.

Screen Shot 2020-10-26 at 12.29.47 AM.png

In their evaluations, the researchers compared mT5 with related models such as mBERT, XLM and XLM-R on five tasks from the Xtreme multilingual benchmark. The largest of the proposed models, mT5-XXL, reached SOTA performance on all the tasks.

The new work shows that the T5’s strengths can also be applied to a multilingual model environment and achieve SOTA performance on a diverse set of tasks. The results underscore the importance of model capacity in cross-lingual representation learning, suggesting that a strategy of scaling up a simple pretraining recipe could serve as a viable alternative to more technically complex current approaches.

The paper mT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer is on arXiv. The associated code and model checkpoints are available on the project GitHub.


Analyst: Yuqing Li | Editor: Michael Sarazen


B4.png

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon KindleAlong with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.


AI Weekly.png

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

3 comments on “Google ‘mT5’ Pretrained Text-to-Text Transformer Achieves SOTA Performance on Multilingual Benchmarks

  1. Pingback: [R] Google ‘mT5’ Pretrained Text-to-Text Transformer Achieves SOTA Performance on Multilingual Benchmarks – tensor.io

  2. very good

  3. its really kul stuff, hope you post more articles

Leave a Reply

Your email address will not be published.

%d bloggers like this: