Machine Learning & Data Science Nature Language Tech Popular

New Multitask Benchmark Suggests Even the Best Language Models Don’t Have a Clue What They’re Doing

Researchers introduce a test covering topics such as elementary mathematics, designed to measure language models' multitask accuracy.

Transformer-based language models have excelled on natural language processing (NLP) benchmarks thanks to their pretraining on massive text corpora, including all of Wikipedia, thousands of books and countless websites. Although the models are exposed to all that information, researchers remain unsure of just how capable they are at learning and applying knowledge, i.e. how much do these language models actually understand?

Not a lot, as it turns out.

The recently published paper, Measuring Massive Multitask Language Understanding, introduces a test covering topics such as elementary mathematics, US history, computer science, law, etc., designed to measure language models’ multitask accuracy. The authors, from UC Berkeley, Columbia University, UChicago, and UIUC, conclude that even the top-tier 175-billion-parameter OpenAI GPT-3 language model is a bit daft when it comes to language understanding, especially when encountering topics in greater breadth and depth than explored by previous benchmarks.


To evaluate how well language models can extract useful knowledge from massive corpora to solve problems, the researchers compiled a test set of 15,908 questions across 57 diverse topics in STEM, the humanities, and social sciences. (Could the choice of 57 classes be an homage to DeepMind’s pioneering Agent57 deep reinforcement learning agent, which bettered human gamers’ scores in the Atari57 Arcade Learning environment?)

Unlike current benchmarks that measure the commonsense or narrow linguistic understanding underlying the language models, the new test seeks to “measure arbitrary real-world text understanding” and “comprehensively evaluate the breadth and depth of a model’s academic and professional understanding.”

The massive multitask test comprises multiple-choice questions at different levels of difficulty from various branches of knowledge, divided into a few-shot development set, a validation set, and a test set. Each of the 57 subjects contains at least 100 test examples to examine the models in zero-shot and few-shot settings.


The language models evaluated were the UnifiedQA (with T5), and the GPT-3 in variants with 2.7 billion, 6.7 billion, 13 billion and 175 billion parameters. In the experiments, all models ranked below expert-level performance for all tasks. The largest GPT-3 model had the best performance, scoring an average of 43.9 percent accuracy, improving over random chance by about 20 percentage points. The model’s highest accuracy was 69 percent in the US Foreign Policy question class, while it scored lowest in College Chemistry, where its 26 percent was about the same as random responses would return. For all models, the tasks with near-random accuracy (25 percent) included topics related to human values, for example, law and morality; but also, perhaps surprisingly, calculation-heavy subjects such as physics and mathematics.

The researchers found that GPT-3 performs poorly on highly procedural problems, and they suspect this is because the model obtains declarative knowledge more readily than procedural knowledge. Compared to verbal subjects, calculation-heavy STEM subjects were more likely to stump GPT-3. For example, while it knows PEMDAS stands for Parentheses Exponents Multiplication Division Addition Subtraction, a common technique for remembering the order of mathematical operations within an equation, it failed to apply this knowledge to calculate the answer to (1 + 1) × 2 =?

The team “worryingly” expressed their concern that “GPT-3 does not have an accurate sense of what it does or does not know since its average confidence can be up to 24% off from its actual accuracy.” No wonder New York University Associate Professor and AI researcher Julian Togelius previously tweeted that “GPT-3 often performs like a clever student who hasn’t done their reading trying to bullshit their way through an exam. Some well-known facts, some half-truths, and some straight lies, strung together in what first looks like a smooth narrative.


The paper Measuring Massive Multitask Language Understanding is on arXiv.

Reporter: Fangyu Cai | Editor: Michael Sarazen


Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.

AI Weekly.png

We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weeklyto get weekly AI updates.

Thinking of contributing to Synced Review? Synced’s new column Share My Research welcomes scholars to share their own research breakthroughs with global AI enthusiasts.


4 comments on “New Multitask Benchmark Suggests Even the Best Language Models Don’t Have a Clue What They’re Doing

  1. Pingback: GPT-3 - MarketMuse

  2. Pingback: How to Cut Through the Hype of GPT-3 – The Best

  3. Pingback: How to Cut Through the Hype of GPT-3 – Best Trendin'

  4. Pingback: How to Cut Through the Hype of GPT-3 and Truly Understand It

Leave a Reply

Your email address will not be published. Required fields are marked *