Generalization is one of the primary goals in contemporary machine learning research and is regarded as a pathway to artificial general intelligence. Although today’s pretrained large language models (LMs) continue to push the state-of-the-art in natural language processing (NLP), most such models target specific problem classes and suffer significant performance drops when applied to new tasks. Is it possible to pretrain language models that will work well across many diverse tasks?
A Google Research/Brain team addresses this question in the new paper Unifying Language Learning Paradigms, proposing UL2, a framework for pretraining universal language models that are effective across many different tasks. Their 20B parameter model surpasses the state-of-the-art 175B GPT-3 on the zero-shot SuperGLUE benchmark and triples the performance of T5-XXL on one-shot summarization tasks.
The UL2 framework aims at building a universally applicable language model that is consistently effective across various types of datasets, tasks, and setups. UL2 is driven by Mixture-of-Denoisers (MoD), a novel pretraining objective that integrates diverse pretraining paradigms to enable a single model to maintain strong performance across different tasks.
MoD employs three main paradigms during pretraining: R-Denoiser, a standard denoiser that is good at acquiring knowledge instead of learning to generate fluent text; S-Denoiser, designed for specific denoising cases where a strict sequential order can be observed for framing input-to-target tasks; and X-Denoiser, which is adopted when the model needs to recover a large part of the input but is only given a small moderated part. A novel mode-switching feature enables dynamic mode switching via discrete prompting, such that the model can switch between the R, S and X denoisers on-demand when learning downstream tasks.
In their empirical study, the team conducted extensive experiments on diverse tasks ranging from supervised to prompt-based in-context few-shot learning. In the evaluations, the proposed UL2 outperformed a T5 baseline by 43.6 percent and GPT-like models by 76.1 percent. The team also scaled UL2 to 20B parameters and ran the model on 50+ NLP tasks, where it achieved state-of-the-art performance on a vast majority of the tasks and setups. In zero/few-shot experiments, UL2 surpassed GPT-3 175B on the zero-shot SuperGLUE benchmark.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.