The November release of ChatGPT garnered unprecedented public and media attention. OpenAI’s conversational large language model (LLM) was widely applauded for its ability to answer complex queries, generate correct computer code and coherent long-form essays, and even solve math problems. But might that last claim have been premature?
In the new paper Mathematical Capabilities of ChatGPT, a research team from the University of Oxford, TU Wein, University of Cambridge, University of Vienna, and Princeton University tests ChatGPT’s mathematical capabilities on publicly available and hand-crafted datasets and evaluates its suitability as an assistant to professional mathematicians. The team concludes that despite the glowing media reviews, ChatGPT’s mathematical abilities “are significantly below those of an average mathematics graduate student.”
The team summarizes their main contributions as follows:
- Insight for mathematical use is provided. We show for which types of questions and which domains of mathematics ChatGPT may be useful and how it could be integrated into the workflow of a mathematician.
- The failure modes of ChatGPT are identified, as well as the limits of its capabilities. This can aid future efforts to develop LLMs that perform better in mathematics.
- We provide benchmarks for testing the mathematical capabilities of future LLMs so that they can be compared to ChatGPT across a range of aspects regarding advanced mathematical comprehension.
To effectively evaluate ChatGPT on advanced math problems, the researchers build a new dataset, GHOSTS, comprising a total of 728 prompts in six carefully crafted subdatasets: Grad-Text, Holes-in-Proofs, Olympiad-Problem-Solving, Symbolic-Integration, MATH, and Search-Engine-Aspects. The researchers say the GHOST datasets surpass publicly available benchmark mathematical datasets in terms of sophistication and reasoning difficulty.
The researchers use LaTeX to encode the mathematical inputs for most of their subdatasets, which are categorized into four dimensions with ascending difficulty: 1) elementary arithmetic problems, 2) symbolic problems, 3) (under)graduate-level exercises from well-known textbooks and questions from math.stackexchange.com, and 4) exercises in the style of Mathematical Olympiad problems.
The team applied ChatGPT on the GHOST datasets and considered output length, stability of the answer under prompt engineering, and how close they judged ChatGPT to be to the correct answer.
ChatGPT failed on most of the problems, faring especially poorly on questions requiring deep insights and original solutions such as those found in the Mathematical Olympiads. The paper concludes that while ChatGPT can effectively search for mathematical objects when given information about them, it struggles with advanced mathematics and delivering consistent, high-quality proofs or calculations.
The team hopes their work will inspire other professional mathematicians to contribute to building a more thorough benchmark for assessing and improving LLMs’ mathematical abilities.
Just one day before this paper was published, OpenAI announced it had upgraded ChatGPT with improved mathematical capabilities. It’s unclear how this latest version would perform in the experiments presented here.
The GHOSTS dataset will be released on the project’s GitHub. The paper Mathematical Capabilities of ChatGPT is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.