San Francisco research company OpenAI has developed Sparse Transformer, a deep neural network which outperforms current state-of-the-art techniques for predicting long-sequence data in text, image and sound. The Sparse Transformer method utilizes an improved algorithm based on the attention mechanism, which can predict a length 30 times longer than the previous maximum.
Modeling long-range and subtle interdependencies in complex data, such as images, video or sound is one of the most significant research challenges in AI. The models built on these sorts of data were previously tailored to a specific domain and were difficult to implement on a sequence with more than a few thousand elements.
Sparse Transformer reduces the computational complexity of the traditional attention mechanism model and can be applied directly to different data types. It can be used to model sequences with more than tens of thousands of elements.
The major contribution of this paper is OpenAI researchers’ creation of a modified Transformer architecture with sparse attention. Traditional Transformer architecture with attention is more flexible than models with fixed connectivity patterns, but consumes a lot of memory when dealing with high-dimensional data like images or raw audio. Back-propagation is a good technique for reducing memory consumption by increasing compute, making Transformer models with increased depth (more hidden layers) trainable.
Transformer architecture still however requires more memory and computation to perform well on long-sequence data inputs, which can become impractical in lab experiments. That is why OpenAI researchers turned to sparse attention. Through the visualization of attention patterns in deep Transformers architecture, researchers discovered interpretable and structured sparsity patterns. This inspired them to implement a two-dimensional factorization of the attention matrix — strided attention and fixed attention.
Sparse Transformer set new records on the CIFAR-10, Enwik8, and ImageNet 64 while achieving lower errors and faster training speeds than Transformer with full attention. This new method can be qualitatively evaluated in image completion tasks. It can also be used to generate raw audio by simply changing position embeddings.
In order to simplify the experimentation, OpenAI implemented a set of block-sparse kernels that are able to perform the operations efficiently on GPUs. The kernels are now open-sourced on GitHub with examples of sparse attention functions.
The researchers believe that learning sparse patterns will be a significant research approach for the next-generation of neural network architectures, as Rewon Child and Scott Gray explain in an OpenAI Blog post: “Even with the improvements we described above, autoregressive sequence generation still seems impractical for very high resolution images or video. The optimized attention operations we have introduced, however, may be useful primitives to combine with other approaches to modeling high dimensional data, like multi-scale approaches.”
More information is available on the original OpenAI blog. The paper Generating Long Sequences with Sparse Transformers is on arXiv.
Author: Herin Zhao | Editor: Michael Sarazen