Powered by their ever-increasing scale, today’s large language models have shown breakthrough capabilities beyond natural language processing (NLP), in areas such as writing computer code, diagnosing medical conditions and playing competitive games. As the development and deployment of large-scale language models continues, it is important that the AI community understands their current and near-future capabilities and limitations.
In the new paper Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models, 444 authors from 132 institutions introduce Beyond the Imitation Game (BIG-bench), a large-scale, extremely difficult and diverse benchmark that includes 204 tasks for predicting the potentially transformative effects of large language models.

BIG-bench was named in homage to Alan Turing’s imitation game (Turing, 1950); and designed for analyzing dense and sparse transformer models such as those from Google and OpenAI, whose scales range from millions to hundreds of billions of parameters. The team summarizes their BIG-Bench suite as follows:
- A set of 204 or more language tasks. As reflected in the BIG-bench review criteria, benchmark tasks are novel, cover a diverse range of topics and languages, and are not fully solvable by current models.
- BIG-bench Lite: a small, representative, and canonical subset of tasks that allows for faster evaluation than on the whole benchmark.
- Code that implements the benchmark API, supports task evaluation on publicly available models, and enables lightweight creation of new tasks.
- Detailed evaluation results on dense and sparse language models with sizes that span six orders of magnitude, as well as baseline results established by human evaluators.

BIG-bench supports two types of tasks: JSON (JavaScript Object Notation) and programmatic. The JSON file contains a list of input-target pairs, and performance is evaluated by comparing the outputs and the targets. The programmatic tasks are written in Python and are evaluated by measuring the generated text continuations for given inputs and computing conditional log probabilities of target given inputs.
The BIG-bench task scope ranges from writing codes, playing competitive games and common-sense reasoning to social bias, linguistics, software development and beyond. It can also measure progress well beyond the current state of the art.

The researchers’ experiments with BIG-bench revealed a number of behavioural characteristics of large language models, such as: 1) Aggregate performance improves with model size but can’t compete with human performance; 2) Model predictions grow better calibrated with increased scale; 3) Model classes behave similarly, with benefits from sparsity; 4) Breakthrough behaviour is sensitive to details of task specification; and 5) Even programmatic measures of model capability can be highly subjective.
The team also tackled the thorny topic of social biases in large language models. They observed that biases often increase with scale in settings with broad or ambiguous context and can decrease with scale in settings with narrow unambiguous context; and that biases can potentially be steered through appropriately chosen prompting.
The team considers BIG-bench a “living benchmark” and will continue to accept new task submissions for peer review on a rolling basis. They hope BIG-bench can help identify additional breakthrough capabilities and enable researchers to better understand the power and potential of current and future large language models.
The BIG-bench project was collaboratively developed on the GitHub repository, where the code is now open-sourced. The paper Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “444 Authors From 132 Institutions Release BIG-bench: A 204-Task ‘Extremely Difficult and Diverse’ Benchmark for Large Language Models”