A Google research team recently published the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, introducing a novel “Text-to-Text Transfer Transformer” (T5) neural network model which can convert any language problem into a text-to-text format. The T5 model demonstrated state-of-the-art performance on GLUE, SQuAD, and CNN/Daily Mail datasets; and scored an impressive 88.9 on the SuperGLUE language benchmark — just a fraction short of the 89.8 human baseline.
Paper Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code. (arXiv)
Synced invited Samuel R. Bowman, an Assistant Professor at New York University who works on artificial neural network models for natural language understanding, to share his thoughts on the “Text-to-Text Transfer Transformer” (T5) framework.
How would you describe the “Text-to-Text Transfer Transformer” (T5) framework?
T5 is an extremely large new neural network model that is trained on a mixture of unlabeled text (the authors’ huge new C4 collection of English web text) and labeled data from popular natural language processing tasks, then fine-tuned individually for each of the tasks that they authors aim to solve. It works quite well, setting the state of the art on many of the most prominent text classification tasks for English, and several additional question-answering and summarization tasks as well.
The most obvious new idea behind this work is that it is a text-to-text model: During training, the model is asked to produce new text as an output, even for training tasks that would normally be modeled as classification and regression tasks with much simpler kinds of output. However, this idea seems to have been chosen out of engineering convenience, and there’s no evidence that it’s necessary for the results they’ve seen. Instead, what makes this work successful (and impressive) is that the authors took many of the best ideas from a number of recent works in NLP and did an extremely good job at rigorously testing and refining each idea as they added it. This yielded both a very well-tuned model and a lot of insights into the fine-grained design decisions that go into training a large general-purpose neural network on language.
Why does this research matter?
T5 represents the latest in a sequence of five or ten papers from the last couple of years on the same basic idea: training large neural networks on large amounts of unlabeled text, then fine-tuning them on labeled text for specific tasks. This series of papers has found that by refining and scaling that idea, we can achieve far better performance on most language understanding tasks than the typical NLP researcher might have thought possible.
T5 claims the state of the art on more than twenty established NLP tasks. It’s extremely rare for a single method to yield consistent progress across so many tasks. That list includes most of the tasks in the GLUE and SuperGLUE benchmarks, which have caught on as one of the main measures of progress for applied language understanding work of this kind (and which my group helped to create). On many of these task datasets, T5 is doing as well as human crowdworkers, which suggests that it may be reaching the upper bound on how well it is possible to do on our metrics.
What impact might this work bring to the field?
I suspect that T5 will have a similar practical impact to many of the previous milestones in this line of work, like ELMo, OpenAI GPT, BERT, and RoBERTa: Many engineers who work on language understanding tasks in English will start to use a pretrained T5 model as a starting point for building systems, and many NLP researchers will work to identify and push past the limitations of the T5 model.
Can you identify any bottlenecks in the research?
From the perspective of researchers who want to use T5, its size is a huge obstacle: The full model is more than thirty times the size of established general-purpose NLP models like BERT, and these earlier models were already big enough to be difficult and expensive to use on commodity GPU hardware. I suspect that the existence of this model will help push more users toward Google’s Cloud TPU product, which was used for the experiments in the paper, and I suspect that there will be a great deal of work over the next year to try to reproduce this level of performance in smaller models. (Google’s ALBERT, which also came out recently, suggests some ways that we might make the target-task training process more efficient, and the NLP community has been increasingly interested in test-time model compression to speed up inference and better enable mobile applications.)
From my perspective as someone interested in evaluation and fair competitions, the success of T5 presents another puzzle. Anyone who spends time using modern NLP systems, including systems built on other recent state-of-the-art networks like RoBERTa, will recognize that these systems are brittle, and often fail in un-human-like ways. Language understanding isn’t solved, even on well-studied languages like English, and even when we use tremendous resources. However, models like T5 are now showing human-level test set performance on most of the tests that the community has been using to measure understanding. If we want to be able to establish fair benchmarks that encourage future progress toward robust, human-like language understanding, we’ll need to get better at creating clean, challenging, and realistic test datasets.
Finally, and most importantly, models trained on large text datasets reliably learn a variety of human biases around things like race, gender, and nationality. For example, our SuperGLUE diagnostic tool (based on Winogender Schemas from Rachel Rudinger) shows that these models will often correctly classify sentences that involve stereotypical situations, like a a conversation between male doctor and a female nurse, but then fail on nearly identical sentences that break those stereotypes, like one with a female doctor and a male nurse. These biases mean that it’s often illegal or unethical to use models like T5 in products, or at least requires product engineers to do a great deal of difficult and task-specific debiasing work. Finding reliable and task-independent methods for reducing and controlling these biases is a huge open problem for NLP right now.
Can you predict any potential future developments related to this research?
This is a strange and exciting time to be working on NLP. The progress that we have made over the last few years has made it seem more and more urgent (and less and less crazy) to start talking about what it would mean to solvelanguage understanding. We clearly aren’t there yet, and it’s not possible to completelysolve language understanding without building a complete humanlike AI, but new building blocks like transformer neural networks, huge text datasets, and carefully-designed semi-supervised learning methods are starting to seem close. It doesn’t seem like we’ve yet hit the limits of how well these methods can do.
The paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer is on arXiv. Related dataset, pre-trained models, and code are available here.
About Prof. Samuel R. Bowman
Sam Bowman has been an assistant professor at NYU since 2016, when he completed PhD with Chris Manning and Chris Potts at Stanford. At NYU, Sam is jointly appointed between the new school-level Center for Data Science, which focuses on machine learning; and the Department of Linguistics. He is also a co-PI of the CILVR machine learning lab and an affiliate member of the Courant Institute’s Department of Computer Science. Sam’s research focuses on data, evaluation techniques, and modeling techniques for sentence and paragraph understanding in natural language processing; and on applications of machine learning to scientific questions in linguistic syntax and semantics. Sam organized a twenty-three person research team at JSALT 2018 and has received a 2015 EMNLP Best Resource Paper Award, a 2017 Google Faculty Research Award, and a 2019 SEM Best Paper Award.
Synced Insight Partner Program
The Synced Insight Partner Program is an invitation-only program that brings together influential organizations, companies, academic experts and industry leaders to share professional experiences and insights through interviews and public speaking engagements, etc. Synced invites all industry experts, professionals, analysts, and others working in AI technologies and machine learning to participate.
Simply Apply for the Synced Insight Partner Program and let us know about yourself and your focus in AI. We will give you a response once your application is approved.