Google & Stanford Team Applies Chain-of-Thought Prompting to Surpass Human Performance on Challenging BIG-Bench Tasks

In the new paper Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, a Google Research and Stanford University team applies chain-of-thought (CoT) prompting — a series of intermediate reasoning steps — to 23 BIG-Bench tasks on which language models have failed to outperform the average human rater. The proposed approach enables models to surpass human performance on 17 of the 23 tasks.

Today’s large language models (LLMs) have demonstrated game-changing performance across a wide range of tasks and domains, but they have their limits. These weaknesses can be identified by the Beyond the Imitation Game benchmark (BIG-Bench, Srivastava et al., 2022), which evaluates LLM capabilities on a diverse suite of especially challenging tasks. A 540B parameter PaLM language model surpasses average human-rater performance on 65 percent of the BIG-Bench tasks, but what about the remainder — are they simply unsolvable by LLMs?

A Google Research and Stanford University team addresses this question in the new paper Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. The team applies chain-of-thought (CoT) prompting — a series of intermediate reasoning steps inspired by the paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022b) — to 23 BIG-Bench tasks on which LLMs have failed to match the average human rater. Their strongest resulting model outperforms the human baseline on 17 of the 23 tasks.

The team selected their 23 evaluation tasks — a subset they dub BIG-Bench Hard (BBH) — from BIG-Bench tasks where state-of-the-art LLMs perform worse than the average human rater, tasks fundamentally not solvable by scaling existing LLMs, and tasks that require prompting techniques beyond the standard few-shot prompting setup.

In their experiments, the team applied the standard BIG-Bench answer-only prompting setup and the proposed CoT prompting approach on three language model families — Codex, InstructGPT and PaLM — to explore whether and to what extent CoT prompting can improve performance on the 23 BBH tasks.

The results show that conventional answer-only prompting underestimates LLM performance and capabilities on challenging tasks that require multiple reasoning steps; as CoT prompting achieves double-digit improvements for all three models, surpassing the average human-rater score on 10 of the 23 tasks on PaLM, on 15/23 tasks on InstructGPT, and on 17/23 tasks on Codex.

The paper also details the effects of CoT prompting on four BBH task categories: algorithmic and multi-step arithmetic reasoning, natural language understanding, use of world knowledge, and multilingual knowledge and reasoning.

The data, prompts, and Codex model outputs are available on the project’s GitHub. The paper Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

7 comments on “Google & Stanford Team Applies Chain-of-Thought Prompting to Surpass Human Performance on Challenging BIG-Bench Tasks”

five nights at freddy's

2024-03-20

I’m so lucky to have found this site

Loading...

Reply
Granny 2

2025-08-13

You are a very good writer. I read all of your articles, and they were extremely intriguing.

Loading...

Reply
plants vs brainrots

2025-12-02

Thank you so much for your kind words! Your feedback truly motivates me to keep writing and exploring new topics. I look forward to sharing more intriguing content with you in the future!

Loading...

Reply
sand loop skills

2026-01-04

Thanks for sharing this update! It’s cool to see how chain-of-thought prompting helps AI beat humans on tough tasks like BIG-Bench. The progress with large language models is really impressive.

Loading...

Reply
Wacky Steps

2026-05-28

ry adjusting the XY size compensation in your slicer, or sand the joints slightly.

Loading...

Reply
99 nights in the forest

2026-06-17

Wow, it’s amazing that chain-of-thought prompting can help AI beat humans on tough tasks. I wonder what other limits this technique can push past. Thanks for sharing this cool research!

Loading...

Reply
julie

2026-06-18

thank you

Loading...

Reply

Google & Stanford Team Applies Chain-of-Thought Prompting to Surpass Human Performance on Challenging BIG-Bench Tasks

Like this:

7 comments on “Google & Stanford Team Applies Chain-of-Thought Prompting to Surpass Human Performance on Challenging BIG-Bench Tasks”

Leave a Reply Cancel reply

Related

Share this:

Like this:

7 comments on “Google & Stanford Team Applies Chain-of-Thought Prompting to Surpass Human Performance on Challenging BIG-Bench Tasks”

Leave a Reply Cancel reply

Related