Compositional generalization is the capacity to understand and produce novel combinations of known language components to make “infinite use of finite means.” While this is second nature for humans, classical AI techniques such as grammar or search-based systems have also demonstrated this ability in the field of natural language processing (NLP).
State-of-the-art deep learning architectures such as transformers however struggle with capturing the compositional structures in natural language, and thus fail to generalize compositionally.
In the new paper Making Transformers Solve Compositional Tasks, a Google Research team explores the design space of transformer models in an effort to enable deep learning architectures to solve natural language compositional tasks. The proposed approach provides models with inductive biases via design decisions that significantly impact compositional generalization, and achieves state-of-the-art results on semantic parsing compositional generalization and string edit operation composition benchmarks.
The team summarizes their main contributions as:
- A study of the transformer architecture design space, showing which design choices result in an inductive learning bias that leads to compositional generalization across a variety of tasks.
- Achieving state-of-the-art results on datasets such as COGS, where we report a classification accuracy of 0.784 using an intermediate representation based on sequence tagging (compared to 0.35 for the best previously reported model (Kim and Linzen, 2020)), and the productivity and systematicity splits of PCFG (Hupkes et al., 2020).
This study focuses on the standard transformer model, which comprises an encoder and a decoder. Given a sequence of token embeddings, the transformer network will output a sequence of tokens generated one at a time by using predictions based on the output distribution generated by the decoder.
Although compositional generalization seems a difficult task, previous studies have shown it can be treated as a general out-of-distribution generalization problem. Inspired by this idea, the researchers hypothesize that different transformer architecture choices will give models different inductive biases that make them more or less likely to discover symmetries that will better generalize to out-of-distribution samples.
The researchers evaluated the compositional generalization abilities of transformers with different architectural configurations, particularly: (1) The type of position encodings, (2) The use of copy decoders, (3) Model size, (4) Weight sharing, and (5) The use of intermediate representations for prediction. They used sequence-level accuracy as their evaluation metric.
In the experiments, the baseline transformer achieved an average sequence-level accuracy of only 0.137. By changing the design decisions, its accuracy increased to up to 0.527. Moreover, the proposed method achieved state-of-the-art results on the COGS dataset (0.784 accuracy) and on PCFG splits (0.634 and 0.828 respectively).
Overall, the study shows how different design decisions can provide inductive biases that enable models to generalize to certain symmetries in input data, thus significantly improving compositional generalization compared to previously reported baseline transformer performance for compositional generalization in language and algorithmic tasks.
The paper Making Transformers Solve Compositional Tasks is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.