AI Machine Learning & Data Science Research

Attention Is Not All You Need: Google & EPFL Study Reveals Huge Inductive Biases in Self-Attention Architectures

A research team from Google and EPFL proposes a novel approach that sheds light on the operation and inductive biases of self-attention networks, and finds that pure attention decays in rank doubly exponentially with respect to depth.

The 2017 paper Attention is All You Need introduced transformer architectures based on attention mechanisms, marking one of the biggest machine learning (ML) breakthroughs ever. A recent study proposes a new way to study self-attention, its biases, and the problem of rank collapse.

Attention-based architectures have proven effective for improving ML applications in natural language processing (NLP), speech recognition, and most recently in computer vision. Research aimed at understanding the inner workings of transformers and attention in general, however, has been limited.

In the paper Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth, a research team from Google and EPFL (École polytechnique fédérale de Lausanne) proposes a novel approach that sheds light on the operation and inductive biases of self-attention networks (SANs) and finds that pure attention decays in rank doubly exponentially with respect to depth.

image.png

The researchers summarize their work as follows:

  1. Present a systematic study of building blocks of the transformer, revealing opposing impacts between self-attention and the counteracting forces: skip connections and MLP, in contributing and preventing a rank collapse in transformers.
  2. Propose a new method for analyzing SANs via a path decomposition, revealing SANs as an ensemble of shallow networks.
  3. Verify the theory with experiments on common transformer architectures.

The team began by studying the building structure of SANs with skip connections and multi-layer perceptrons (MLPs) disabled. They considered the SAN as a directed acyclic graph, with every node corresponding to a self-attention head and directed edges connecting heads of consecutive layers. Based on this, they then built a path decomposition to describe the actions of a multi-head SAN as the combination of simpler single-head networks. Through the path interactions, they observed that biases are not particularly meaningful; and that each path converges rapidly to a rank-1 matrix with identical rows. The interesting part came when the paths were increased exponentially: each path then degenerated doubly exponentially, resulting in a rank-1 output.

image.png
Two paths in a deep Self-Attention Network (SAN) with H heads and L layers. At each layer, a path can go through one of the heads or bypass the layer. Adding an MLP block after each attention layer forms the transformer architecture.

The researchers considered the behaviour of each path separately, examining how the residual changes during the forward pass. They discovered that the residual norm converges to zero surprisingly quickly (doubly exponentially with a cubic rate). As the rank of attention matrices also depends on the rank of the input, the identified cubic rate of convergence is significantly faster than what would have been expected. In other words, deeper SANs will lead to a cascading effect.

In an effort to obtain a deeper understanding of the structure of SANs, the team expanded their analysis by incorporating the three key transformer components that SANs lack: skip connections, MLPs, and layer normalization. This examination revealed that the SANs with enabled skip connections relied heavily rely on short paths, behaving like ensembles of shallow single-head self-attention networks. The team also discovered that MLPs counteract convergence, such that as MLPs become more powerful, convergence becomes slower; and that layer normalization does not mitigate the rank collapse.

The team conducted the following experiments:

  1. Rank collapse in real architectures, examining the residual of popular transformer architectures BERT, Albert, and XLNet.
  2. Visualizing the biases of different architectures, studying the behaviour of a single-layer transformer when applied recurrently to predict a simple 2D circular sequence.
  3. Testing path effectiveness with respect to length through three tasks: Sequence memorization, Learning to sort, and Convex hull prediction.
image.png
Results of experiment 1. Relative norm of the residual along the depth for three models before and after training. Pure attention (SAN) converges rapidly to a rank-1 matrix. Adding MLP blocks and skip connections gives a transformer. Skip connections play a critical role in mitigating rank collapse (i.e., a zero residual).
image.png
Results of experiment 2. Applying a trained single-layer transformer module recurrently, to models of increasing hidden dimension (horizontal direction) and across architectural variants (vertical direction). The two light background paths illustrate the two training trajectories, for which the starting points are (−0.3, 0) and (0.3, 0).
image.png
Results of experiment 3, report the test-set per-token label prediction accuracy as the evaluation metric. To determine how much of the expressive power can be attributed to short vs long paths, the researchers examined the performance of subsets of paths of different lengths (rather than of the entire SAN).

The first experiment confirmed that when skip connections are removed, all networks exhibit a rapid rank collapse, while the second showed that adding MLP or skip connections either stops or drastically slows down rank collapse. The last experiment supported the researchers’ hypothesis that short paths are responsible for the majority of SANs’ expressive power.

The paper Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

3 comments on “Attention Is Not All You Need: Google & EPFL Study Reveals Huge Inductive Biases in Self-Attention Architectures

%d bloggers like this: