Today’s large language models (LLMs) have demonstrated game-changing performance across a wide range of tasks and domains, but they have their limits. These weaknesses can be identified by the Beyond the Imitation Game benchmark (BIG-Bench, Srivastava et al., 2022), which evaluates LLM capabilities on a diverse suite of especially challenging tasks. A 540B parameter PaLM language model surpasses average human-rater performance on 65 percent of the BIG-Bench tasks, but what about the remainder — are they simply unsolvable by LLMs?
A Google Research and Stanford University team addresses this question in the new paper Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. The team applies chain-of-thought (CoT) prompting — a series of intermediate reasoning steps inspired by the paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022b) — to 23 BIG-Bench tasks on which LLMs have failed to match the average human rater. Their strongest resulting model outperforms the human baseline on 17 of the 23 tasks.
The team selected their 23 evaluation tasks — a subset they dub BIG-Bench Hard (BBH) — from BIG-Bench tasks where state-of-the-art LLMs perform worse than the average human rater, tasks fundamentally not solvable by scaling existing LLMs, and tasks that require prompting techniques beyond the standard few-shot prompting setup.
In their experiments, the team applied the standard BIG-Bench answer-only prompting setup and the proposed CoT prompting approach on three language model families — Codex, InstructGPT and PaLM — to explore whether and to what extent CoT prompting can improve performance on the 23 BBH tasks.
The results show that conventional answer-only prompting underestimates LLM performance and capabilities on challenging tasks that require multiple reasoning steps; as CoT prompting achieves double-digit improvements for all three models, surpassing the average human-rater score on 10 of the 23 tasks on PaLM, on 15/23 tasks on InstructGPT, and on 17/23 tasks on Codex.
The paper also details the effects of CoT prompting on four BBH task categories: algorithmic and multi-step arithmetic reasoning, natural language understanding, use of world knowledge, and multilingual knowledge and reasoning.
The data, prompts, and Codex model outputs are available on the project’s GitHub. The paper Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.