When it comes to large language models, it turns out that even 1.5 billion parameters is not large enough. While that was the size of the GPT-2 transformer-based language model that OpenAI released to much fanfare last year, today the San Francisco-based AI company outdid itself, announcing the upgraded GPT-3 with a whopping 175 billion parameters.
GPT-3 adopts and scales up the GPT-2 model architecture — including modified initialization, pre-normalization, and reversible tokenization — and shows strong performance on many NLP tasks and benchmarks in zero-shot, one-shot, and few-shot settings.
The OpenAI researchers say the GPT-3 in some cases approaches the performance of SOTA fine-tuned systems, can generate high-quality samples, and shows strong qualitative performance at tasks defined on-the-fly.
Recent research has demonstrated substantial gains on many NLP tasks and benchmarks through an approach that uses pretraining on a large corpus of text followed by fine-tuning on a specific task. But current AI systems still largely struggle to perform a new language task from only a few examples or from simple natural language instructions describing the tasks.
The researchers show through GPT-3 training that scaling up language models can greatly improve task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior SOTA approaches. GPT-3 can be applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.
The researchers evaluated GPT-3 on over two dozen NLP datasets and conducted several novel experiments designed to test rapid adaptation to tasks unlikely to be directly contained in the training set. All evaluations were done under three settings: few-shot learning, one-shot learning, and zero-shot learning.
GPT-3 showed strong performance across many NLP datasets on translation, question-answering, and cloze tasks. It also did well on tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. The new model even generated samples of news articles that human evaluators had difficulty distinguishing from human-written texts.
The researchers trained a series of smaller models — ranging from 125 million parameters to 13 billion parameters — to compare their performance against GPT-3 on the three settings. For most tasks, they found relatively smooth scaling with model capacity in all three settings. They also noticed a pattern wherein the gap between zero-shot, one-shot, and few-shot performance often grows with model capacity, which they believe suggests larger models are more proficient meta-learners.
Although the findings show that even at the scale of the full GPT-3, models still struggle to perform few-shot learning on some tasks, the researchers believe very large language models like GPT-3 will become an important ingredient in the development of adaptable, general language systems.
The paper Language Models are Few-Shot Learners is on arXiv, and more details are available on the project GitHub.
Journalist: Yuan Yuan | Editor: Michael Sarazen