# Study Shows Transformers Possess the Compositionality Power for Mathematical Reasoning

A research team from UC Davis, Microsoft Research and Johns Hopkins University extends work on training massive amounts of linguistic data to reveal the grammatical structures in their representations to the domain of mathematical reasoning, showing that both the standard transformer and the TP-Transformer can compose the meanings of mathematical symbols based on their structured relationships.

From a philosophical perspective, reductionism is the most natural concept in the world: a whole can be understood by understanding its parts. When it comes to cognitive science, this idea can be represented via the principle of compositionality — wherein humans infer structures and relationships based on sensory observations and combine this information with existing knowledge to guide the composition of simpler meanings into complex wholes.

The use of powerful artificial neural networks in natural language processing (NLP) has advanced compositionality to the point where training with massive amounts of linguistic data can reveal the grammatical structures in their representations. In a new study, a research team from UC Davis, Microsoft Research and Johns Hopkins University extends this approach to the domain of mathematical reasoning, showing that both the standard transformer architectures and the TP-Transformer can compose the meanings of mathematical symbols based on their structured relationships.

State-of-the-art transformer architectures have achieved impressive results on various NLP tasks. These neural networks learn to encode input sequences with a high-dimensional vector that can carry both affluent semantic information and extra information regarding the relevant linguistic subcomponents of the input. Some studies have suggested such models are able to extract and compose the meanings of parts to improve their performance on tasks.

In the paper Compositional Processing Emerges in Neural Networks Solving Math Problems, the researchers apply transformers to the domain of mathematical reasoning to evaluate whether deep neural networks have the ability to derive the meanings of entire arithmetic expressions by obtaining and composing the meanings of their sub-expressions. For instance, the value of 1*9 + 2*3 is obtained by composing the values of sub-expressions 1*9 and 2*3 to get the final result of 15.

The study uses standard transformers and TP-transformers trained on a mathematics dataset containing 112 million mathematical word problems covering arithmetic, algebra, calculus, probability, etc. The standard and TP-transformers both have an encoder that processes the question and a decoder that generates outputs. Both contain six transformer layers with multi-head attention modules comprising eight heads. Each head in each layer generates a query, key, and value vector for every input. Taking a softmax of the scaled dot product of the queries and keys generates the attention distributions. The final output is the average of the value vectors of weighted attention distributions. The TP-transformer differs from standard transformers in that it has additional role vectors designed to explicitly capture structural or relational information in the inputs.

The researchers performed experiments to evaluate the proposed approach. The results confirmed that neural networks are not only able to infer the structured relationships implicit in their training data, but can also utilize this knowledge to guide the composition of individual meanings into composite wholes.

The paper Compositional Processing Emerges in Neural Networks Solving Math Problems is on arXiv.

Author: Hecate He | Editor: Michael Sarazen, Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Machine Intelligence | Technology & Industry | Information & Analysis