Systematic generalization is a much-desired property for neural networks, as it enables models to leverage learned knowledge derived from training data in a meaningful way across new tasks and settings.
Many datasets have recently been proposed for testing systematic generalization, but models using the popular baseline transformer architecture have failed miserably in these tasks.
To address this issue, a new paper from a team at the Swiss AI Lab IDSIA proposes it is possible to significantly improve transformers’ systematic generalization performance via “simple tricks” that revisit model configurations as basic as the scaling of embeddings, early stopping, relative positional embedding, and universal transformer variants.
The main thrust of the study is to demonstrate that the systematic generalization capability of transformers — and in particular their universal variants — has been largely underestimated. The researchers show that systematic generalization can be improved dramatically via careful model designs and training configurations, and empirically validate the effectiveness of this idea.
The team first shows that in order to develop and evaluate methods that improve systematic generalization, good datasets and strong baselines are needed. They consider five datasets: 1) SCAN, which consists of mapping a sentence in natural language into a sequence of commands simulating navigation in a grid world; 2) CFQ, which translates a natural language question to a Freebase SPARQL query; 3) PCFG, comprising list manipulations and operations that should be executed; 4) COGS, which consists of semantic parsing that maps an English sentence to a logical form; and 5) Mathematics Dataset, comprising high school-level textual math questions.
The baselines are two transformer architectures: standard transformers and universal transformers, both with absolute or relative positional embedding. The universal transformer variants are simply transformers with shared weights between layers, without adaptive computation time and timestep embedding.
The paper identifies a number of methods that can improve transformers on systematic generalization tasks. The researchers first address the EOS decision problem with relative positional embedding. The EOS decision decides when to end a sequence, and the team shows that transformers struggle in this task mainly due to the absolute positional embedding, which indicates the meaning of a given word does not depend on its absolute position, but rather on its neighbours.
In the EOS decision task experiment, the team tested the performance of their modified model (adjusted layers and hyperparameters) without relative positional embedding (row Trafo), as well as the performance of universal transformer models trained with identical hyperparameters. They observed that both standard and universal transformers with relative positional embedding excel with absolute positional embedding (near-zero accuracy for all length cutoffs), demonstrating the advantages of relative positional embedding and that it can largely mitigate the EOS overfitting issue.
The team then demonstrates the importance of careful model selection, in particular with regard to early stopping. To test this hypothesis, they trained models on the COGS dataset without early stopping. The best model achieved a test accuracy of 81 percent, while the baseline model returned only 35 percent accuracy. They also disabled early stopping in the original codebase, which boosted accuracy to 65 percent without the application of any other tricks.
The lack of validation sets for evaluating models’ generalization ability is another factor that might affect performance. The researchers propose that it would be beneficial if future datasets included a validation and test set for both the IID and the generalization split.
The team also notes that there is an intriguing relationship between generalization accuracy and loss. From their experiments, they conclude that it would be advantageous to use accuracy instead of loss for early stopping and hyperparameter tuning.
Finally, the team applies the methods across a variety of datasets to show that the proposed model achieves substantially higher performance than the best results reported by previous baselines models across all standard datasets.
Overall, the study demonstrates that by revisiting basic model and training configurations, systematic generalization can be improved on transformer architectures, and that reconsidering early stopping and embedding scaling in particular can greatly improve baseline transformers.
The paper The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.