AI Nature Language Tech Research

Do NLP Models Cheat at Math Word Problems? Microsoft Research Says Even SOTA Models Rely on Shallow Heuristics

A Microsoft research team provides concrete evidence showing that existing NLP models cannot robustly solve even the simplest of Math word problems, suggesting the hope that they might capably handle one-unknown arithmetic MWPs is untenable.

“Yoshua recently turned 57. He is three years younger than Yann. How old is Yann?” Solving such a math word problem (MWP) requires understanding the short natural language narrative describing a state of the world and then reasoning out the underlying answer. A child could likely figure this one out, and recent natural language processing (NLP) models have also shown an ability to achieve reasonably high accuracy on MWPs.

A Microsoft Research team recently took a closer look at just how NLP models do this, with surprising results. Their study provides “concrete evidence” that existing MWP solvers tend to rely on shallow heuristics to achieve their high performance, and questions these models’ capabilities to robustly solve even the simplest of MWPs.


MWP tasks can be challenging as they require the machine to extract relevant information from natural language text and perform mathematical operations or reasoning to find the solution. MWPs come in many varieties, the simplest being the “one-unknown” problems involving arithmetic operators (+, −, ∗, /).

Examples of one-unknown arithmetic word problems

Researchers have recently begun applying machine learning to more complex MWPs such as multiple-unknown linear word problems and those concerning geometry and probability. This avenue of research is based on the assumption that one-unknown arithmetic work problems are a cinch for machines. The paper Are NLP Models Really Able to Solve Simple Math Word Problems? advances an opposite opinion and proposes a new challenge dataset to support it.

The paper authors summarize the work’s main contributions as:

  • Show that the majority of problems in benchmark datasets can be solved by shallow heuristics lacking word-order information or lacking question text.
  • Create a challenge set called SVAMP for more robust evaluation of methods developed to solve elementary level math word problems.

The researchers first conducted experiments to show the deficiencies of SOTA MWP solvers. They chose two benchmark datasets: MAWPS and ASDiv-A, and considered three models: Seq2Seq, which consists of a bidirectional LSTM Encoder and an LSTM decoder with attention; GTS, which uses an LSTM encoder to encode and a tree-based decoder; and Graph2Tree, which combines a graph-based encoder with a tree-based decoder. In the case of GTS, the team removed the questions, hence, each problem in the test set comprised only the scenario’s explanatory text. The team also provided the results for models with RoBERT pretrained embeddings.

5-fold Cross Validation Accuracies of baseline models on datasets. (R) means that the model is provided with RoBERTa pre-trained embeddings while (S) means that the model is trained from scratch.
5-fold Cross Validation Accuracies of baseline models on Question-removed datasets

Graph2Tree with RoBERT pretrained embeddings achieved the highest score, 88.7 percent accuracy for MAWPS and 82.2 percent accuracy for ASDiv-A. After removing the questions, the best performing model was still Graph2Tree, which achieved accuracies of 64.4 percent on ASDiv-A and 77.7 percent on MAWPS. The results show these MWP solvers correctly answered the MWP without even looking at the question, suggesting they relied on the presence of simple heuristics in the text body to predict their answers.

5-fold Cross Validation Accuracies (↑) of the constrained model on the datasets.

The researchers also conducted constrained model experiments based on the Seq2Seq architecture, removing the LSTM encoder and replacing it with a feed-forward network. The constrained model with non-contextual RoBERTa embeddings achieved accuracies of 51.2 percent on ASDiv-A and an astounding 77.9 percent on MAWPS, indicating that simply associating the occurrence of specific words in the MWPs to their corresponding equations enabled the model to achieve a high score.

Based on these identified deficiencies of SOTA MLP solvers, the researchers introduced a novel challenge dataset, SVAMP (Simple Variations on Arithmetic Math word Problems). SVAMP was created by applying carefully chosen variations over examples sampled from existing datasets, and contains one-unknown arithmetic word problems such as those taught at a grade four or lower level. The team tested relative coverage by training a model on one dataset and testing it on the other.

Model results on the SVAMP challenge set
SVAMP accuracies without questions
Constrained model accuracies on SVAMP

The results show the SVAMP MWPs are less likely to be solved using simple heuristics; and that even with additional training data, current SOTA models remain far from performance estimates based on prior benchmark datasets.

The work exposes concerning overestimations regarding the ability of NLP models to solve simple one-unknown arithmetic word problems, and shows that building robust methods for solving even elementary MWPs remains an open problem.

The paper Are NLP Models Really Able to Solve Simple Math Word Problems? is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

4 comments on “Do NLP Models Cheat at Math Word Problems? Microsoft Research Says Even SOTA Models Rely on Shallow Heuristics

%d bloggers like this: