Although encoder-decoder networks with attention have achieved impressive results in many sequence-to-sequence tasks, the mechanisms behind such networks’ generation of appropriate attention matrices remains something of a black-box mystery.
In the new paper Understanding How Encoder-Decoder Architectures Attend, researchers from the University of Washington, Google Blueshift Team and Google Brain Team propose a method for decomposing hidden states over sequences into temporal- and input-driven components to reveal how attention matrices are formed in encoder-decoder networks.

The team summarises their study’s contributions as:
- We propose a decomposition of hidden state dynamics into separate pieces, one of which explains the temporal behaviour of the network, another of which describes the input behaviour. We show such a decomposition aids in understanding the behaviour of networks with attention.
- In the tasks studied, we show the temporal (input) components play a larger role in determining the attention matrix as the average attention matrix becomes a better (worse) approximation for a random sample’s attention matrix.
- We discuss the dynamics of architectures with attention and/or recurrence and show how the input/temporal component behaviour differs across said architectures.
- We investigate the detailed temporal and input component dynamics in a synthetic setting to understand the mechanism behind common sequence-to-sequence structures and how they might differ in the presence of recurrence.

The researchers examine three encoder-decoder architectures with varying combinations of recursion and attention: Vanilla Encoder-Decoder (VED), Encoder-Decoder with Attention (AED), and Attention Only (AO). They show that in these architectures, it is useful to write the hidden states using the models’ temporal and input components, as each hidden state has an associated time step and input word at that same time step. This approach enables the disentanglement of temporal and input behaviour from other network dynamics.


The team then demonstrates how each of the three architectures learns to solve tasks, and the role of their input and temporal components. They apply this decomposition to the hidden states and plot the temporal components of both the encoder and decoder. In this way, they are able to observe how attention matrices are formed for each task.
The study’s analytical results show that, depending on the task requirements, encoder-decoder networks rely more on either the temporal or input-driven components; and that this phenomenon holds true across both recurrent and feed-forward architectures no matter how they form their temporal components.
Overall, the work’s decomposition of hidden states into input- and time-independent components provides novel and valuable insights into the inner workings of attention-based encoder-decoder networks. The team proposes future research could investigate how this decomposition works on different sequence-to-sequence tasks such as speech-to-text, and to what degree the observed dynamics generalize to more complicated networks like those with bidirectional RNNs or multiheaded and self-attention mechanisms. They also note that the attention-only architectures examined are akin to transformers, suggesting a similar dynamical behaviour may also hold for that popular non-recurrent architecture.
The paper Understanding How Encoder-Decoder Architectures Attend is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: r/artificial - [R] Washington U & Google Study Reveals How Attention Matrices Are Formed in Encoder-Decoder Architectures - Cyber Bharat