Today’s large language models (LLMs) have moved beyond simple language tasks, demonstrating impressive in-context question-answering capabilities informed by novel user-initiated prompting techniques. However, these models cannot provide an accuracy assessment with regard to their responses, and, as Synced previously reported, they tend to struggle with math word problems and reasoning tasks.
In the new paper MathPrompter: Mathematical Reasoning Using Large Language Models, a Microsoft Research team presents MathPrompter, a novel approach that leverages chain-of-thought (CoT) prompting techniques to improve LLM performance on mathematical reasoning problems and increase confidence in their predictions.
This work was inspired by how humans address math questions: breaking the problem into multiple steps and employing various methods to validate each step. To mimic this process for solving such reasoning tasks in an LLM, the researchers turned to zero-shot, chain-of-thought (CoT) prompting techniques.
The MathPrompter pipeline comprises four steps. Given a question: 1) An algebraic template is generated, replacing the numerical entries with variables; 2) The LLM is fed multiple math prompts that can solve the generated algebraic expression analytically in different ways; 3) The analytical solutions are evaluated by allotting multiple random values to the algebraic expression; and 4) A statistical significance test is applied to the solutions of the analytical functions to find a “consensus” and derive the final solution.
MathPrompter thus leverages solution-verification approaches such as those used by humans — compliance with known results, multi-verification, cross-checking and compute verification — to increase confidence in its generated answers.
In their empirical study, the team evaluated MathPrompter on the MultiArith dataset, a math problems dataset designed to test machine learning models’ abilities on complex arithmetic operations and reasoning.
In the experiments, a 175B MathPrompter model achieved 92.5 percent accuracy — a significant improvement over the 78.7 percent score of a SOTA model of similar size. MathPrompter also achieved performance comparable with SOTA few-shot-CoT models and a larger zero-shot-CoT PaLM model with 540B parameters, demonstrating its ability to significantly improve LLM performance on both zero-shot and few-shot settings. The team’s future plans include incorporating additional prompts into MathPrompter and testing its performance on other datasets.
The paper MathPrompter: Mathematical Reasoning Using Large Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.