In recent years, machine intelligence has made significant progress, thanks to the rise of deep learning. Various neural network paradigms have become state-of-the-art methods across multiple fields, such as computer vision and natural language processing etc. For specific tasks, they can even surpass human-level performance, which is a striking success.
In the article “Memory, attention, sequences”, the author predicts that future work on neural networks will emphasize understanding complex spatio-temporal data from the real world, which is highly contextual and noisy. He further argues that the attention mechanism and memory structure, two active topics of current research, are necessary to achieve such a goal.
Despite the general view that attention and memory will help along the way, it is not too bold to make further statements on how they will be contributing: either as an integrated feature for existing frameworks (like using dropout against overfitting and batch normalization for stable training), or as a standalone module that is self-complete and can be used both separately or jointly with the common CNN or RNN. In the following, we will dive into both possibilities and see how they come naturally from the author’s idea.
Attention is an intrinsic function in the vision system and language understanding. Imagine you are looking at an image, instead of going through each pixel or blocks of pixels sequentially, you subconsciously focus on a few regions of highest information density and filter out the rest. The same thing happens when you read, you tend to only remember words, phrases or sentences that are crucial for understanding, and cannot recite the entire text unless forced to (say preparing for exams). In essence, attention effectively captures contextual information in a hierarchical manner, such that it’s sufficient for decision making while reducing overheads.
Attention models are task specific, and there are a lot of variants in terms of implementation. The basic example given in the article uses a context vector c, which is shared across states y_i to produce individual attentional weights a_i, representing relevance of each y_i to the context vector c. The output z is then a transformed arithmetic mean of the y_i and the weights. This is a “soft” attention, since the attended output is a weight mean of the inputs so the attention process is smoothly differentiable. In contrast, a “hard” attention also generates a set of attentional weights, but they are interpreted as probabilities or likelihoods to pick one specific input as the attended output. Such stochastic operation is not differentiable, and hinders training through normal back-propagation. A typical solution uses an approximate variational lower bound or equivalently by REINFORCE, refer to Show, Attend and Tell: Neural Image Caption Generation with Visual Attention .
Fig 1. “Soft” Attention
Additionally, the choice of context vector c also varies. One way is to dynamically construct c in the network, just as in Neural Machine Translation By Jointly Learning To Align and Translate , where the context vector is formed from the previous hidden state, and the annotation vectors from the bidirectional RNN. On the other hand, c can be a trained vector that remains fixed during test time, as in Hierarchical Attention Networks For Document Classification  . However, a pre-trained context vector is likely to be constrained to specific tasks and non-transferrable.
Fig 2. Attention with dynamic context vector
Fig 3. Hierarchical Attention with trained context vector
Figures 2 and 3 above demonstrate how attention is incorporated to augment the base architectures (CNN and RNN). In the paper Attention Is All You Need  , attention is applied as the base structure to facilitate both contextual inference and faster computation.
The innovation of the attention model shown below (Fig 6) lies on the absence of any CNN or RNN structure The whole neural network used in translation is a typical encoder-decoder architecture where both the encoder and decoder are powered by stacked attention modules. Each module consists of a so-called “multi-head attention” sublayer (Fig 5, essentially a depth-wise attention / attention channels analogous to a convolutional layer), followed by a fully connected sublayer with residual connections. Additionally, the researchers devise a scaled dot-product (Fig 4) to compute the attentional weights, which is both faster and more space-efficient in practice.
Fig 4. Scaled Dot-Product Attention
Fig 5. Multi-Head Attention (attention layers are parallel)
A positional encoding mechanism in Fig 6 is also used to take advantage of the sequentially ordered structure of the data, which can be related to a biological phenomenon known as neural oscillation, according to the author. In the referenced paper Convolutional Sequence to Sequence Learning , adding positional encoding makes only minor improvement to model performance, since the model already learns relative position information on its own. But the Transformer model lacks in learnable base components on RNN or CNN, it seems necessary to incorporate explicit location information by positional encoding without too much overhead.
This architecture is reported to outperform previous state-of-the-art models on both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, showing promising result to future extensions and confirming the possibility to have standalone attention-based models.
Fig 6. Model Architecture – The Transformer
Memory is a key component to human intelligence. In a sense, attention could be seen as a short-term memory devoted to the first few stages during data processing in the human brains. In this article, the author introduces Fast Weights  as an intermediate memory mechanism (Fig 7), sitting between the neural weights (long-term, trained parameters) and recurrent weights (very fast, based on inputs).
In the paper, fast weights are designed as “cache” for storing partial results related to the recent past of the information flow in an iterative settling phase between each time step h_(t-1) to h_t. It is argued that through the iterative phase, the network can attend to important history that attracts more of the current hidden state, hence refining its actions in the future.
This mechanism adds additional flexibility to apply attention. It is reported to beat the common RNNS on a variety of different tasks the authors came up, while still being efficient to train and accord with the biological nature of human brains.
Fig 7. RNN model with Fast Weights
To conclude, attention and memory mechanisms are tackling some of the most fundamental problems in simulating human minds. Despite the significant results they achieved, they still have a great potential to show in the development of more advanced neural networks.
 Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. and Bengio, Y. (2016). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. [online] Available at: https://arxiv.org/pdf/1502.03044.pdf [Accessed 7 Aug. 2017].
 Bahdanau, D., Cho, K. and Bengio, Y. (2016). Neural Machine Translation By Jointly Learning To Align And Translate. [online] Available at: https://arxiv.org/pdf/1409.0473.pdf [Accessed 7 Aug. 2017].
 Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. and Hovy, E. (2016). Hierarchical Attention Networks for Document Classification. [online] Available at: https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf [Accessed 7 Aug. 2017].
 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L. and Polosukhin, I. (2017). Attention Is All You Need. [online] Available at: https://arxiv.org/pdf/1706.03762.pdf [Accessed 7 Aug. 2017].
 Gehring, J., Auli, M., Grangier, D., Yarats, D. and Dauphin, Y. (2017). Convolutional Sequence to Sequence Learning. [online] Available at: https://arxiv.org/pdf/1705.03122.pdf [Accessed 9 Aug. 2017].
 Ba, J., Hinton, G., Mnih, V., Leibo, J. and Ionescu, C. (2016). Using Fast Weights to Attend to the Recent Past. [online] Available at: https://arxiv.org/pdf/1610.06258.pdf [Accessed 7 Aug. 2017].
Blog Author: Eugenio Culurciello
Author: Justin Yuan | Editor: Joni Chung | Localized by Synced Global Team: Xiang Chen