As machine learning models become larger and more powerful, researchers are increasingly seeking ways to reduce their huge computational appetites and improve efficiency. Nowhere is this evidenced more than with transformer architectures, whose superior capabilities in handling long text sequences have brought them to the forefront in the fields of natural language processing (NLP) and sequence modelling, but whose quadratic computational complexity has hindered their application and accessibility.
In the new paper Hierarchical Transformers Are More Efficient Language Models, a team from the University of Warsaw, OpenAI and Google Research proposes Hourglass — a novel hierarchical transformer language model that operates on shortened sequences and achieves a new state-of-the-art in image generation on ImageNet32.
The team summarizes their study’s main contributions as:
- We show how hierarchy can improve the efficiency of transformers in a language modelling setup.
- Hourglass significantly outperforms the baseline both in terms of perplexity reached at a given linear computation cost and empirical metrics like running memory.
- Hourglass achieves state-of-the-art results among autoregressive models on the ImageNet32 generation task and competitive results on other image generation and language modelling tasks.
- Hourglass can be used with any attention type, which opens new directions for future research on transformers capable of processing longer sequences and on improving the trade-off between efficiency and accuracy.
Transformers’ high computation costs are primarily due to their self-attention mechanisms: each self-attention layer has complexity quadratic in the length of the context. Previous studies have proposed techniques such as sparse attention mechanisms designed to modify this attention mechanism without changing the overall transformer architecture. Most such techniques however still force the model to operate on a sequence of the same length as the input, which, as the paper explains, causes both fundamental and practical shortcomings. Fundamentally, while the goal is for models to create high-level representations of words, entities or even whole events, these occur at a very different granularity than the single letters that the model receives on input. On the practical side, even layers with linear complexity can be very slow and memory-intensive when processing very long sequences at the wrong granularity.
To address these issues, the researchers first modify the transformer architecture to shorten the internal sequence of activations when going deeper in the layer stack and expand it back before generation. They then merge tokens into groups using a shortening operation to reduce the overall sequence length. Finally, they up-sample these tokens again, combined with sequences from earlier layers.
The team compared Hourglass with various baseline transformer models in terms of required running memory, computational cost and perplexity on the popular Enwik8, ImageNet and CIFAR benchmarks for both text and image generation tasks.
In the experiments, Hourglass outperformed the transformer baselines in terms of perplexity reached at a given linear computation cost, improved language modelling efficiency on enwik8, and achieved a new state-of-the-art for transformer models on the ImageNet32 generation task. The results indicate that the proposed hierarchy transformer architectures are capable of processing longer sequences and improving the trade-off between efficiency and accuracy. The researchers suggest future work could focus on the shortening mechanism itself and choosing the best hierarchy for particular tasks.
The paper Hierarchical Transformers Are More Efficient Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.