Reliable generalization to out-of-distribution inputs is a crucial feature for developing strong machine learning models. But determining how and why neural networks are able to generalize on algorithmic sequence prediction tasks remains an open question.
In the new paper Neural Networks and the Chomsky Hierarchy, a DeepMind research team conducts an extensive generalization study on neural network architectures that explores whether insights from the theory of computation and the Chomsky hierarchy can predict the practical limits of neural network generalization.
The team summarizes their main contributions as:
- We conduct an extensive generalization study (2200 individual models, 16 tasks) of state-of-the-art neural network architectures (RNN, LSTM, Transformer) and memory-augmented neural networks (Stack-RNN, NDStack-RNN, TapeRNN) on a battery of sequence-prediction tasks spanning all the levels of the Chomsky hierarchy that can be practically tested with finite-time computation.
- We show how increasing amounts of training data do not enable generalization on our tasks higher up in the hierarchy for some architectures (under sufficient capacity to perfectly learn the training data), potentially implying hard limitations for scaling laws.
- We demonstrate how architectures augmented with differentiable structured memory (e.g., with a stack or a tape) can solve tasks higher up the hierarchy.
Many previous works have investigated whether conventional neural network architectures are able to learn a formal language. While these studies have typically focused on single architectures and a limited set of tasks, the DeepMind paper presents an extensive empirical study on a wide range of models with regard to the Chomsky hierarchy.
Named after the influential American linguist and philosopher who developed it, the Chomsky hierarchy is basically a containment hierarchy of formal grammar (unrestricted grammar, context-sensitive grammar, context-free grammar, and regular grammar) that classifies languages based on the type of automaton able to recognize them. By relating different models to the Chomsky hierarchy, it is possible to determine whether they can recognize certain regular languages.
The researchers note that lower-level automata have restrictive memory models and can only solve lower-level problem sets, while atop the hierarchy, Turing machines with infinite memory and unrestricted memory access can solve all computable problems, i.e. are Turing complete.
The paper examines a wide range of neural network architectures and memory-augmented neural networks — transformer, RNN, LSTM, Stack-RNN, NDStack-RNN and Tape-RNN — covering a total of 2200 models applied to 16 sequence-prediction tasks.
The results show that LSTMs and transformers are not Turing complete as they cannot solve simple sequence tasks such as duplicating a string when the sequences are significantly longer than those seen during training. Models interacting with an external memory structures meanwhile can climb the Chomsky hierarchy, indicating this setup as a promising research direction for improving architecture design.
The code is publicly available on the project’s GitHub. The paper Neural Networks and the Chomsky Hierarchy is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.