Transformers have revolutionized a wide array of learning tasks, but their scalability limitations have been a pressing challenge. The exact computation of attention layers results in quadratic runtime and memory complexities, hindering the scaling of transformer models to handle longer context lengths effectively.
In a new paper HyperAttention: Long-context Attention in Near-Linear Time, a research team from Yale University and Google Research presents HyperAttention, an approximate attention mechanism to tackle the computational challenges posed by the increasing complexity of long contexts in Large Language Models (LLMs). HyperAttention not only offers practical efficiency but also delivers the best near-linear time guarantee, making it a remarkable advancement.
The central issue addressed by this research is the approximation of attention, specifically the dot-product attention, which involves processing three input matrices: Q (queries), K (keys), and V (values), all sized according to the number of tokens in the input sequence and the dimension of latent representations. The primary goal is to efficiently approximate the output matrix, Att, while preserving its spectral properties.
The proposed approach involves the development of an efficient estimator for the diagonal scaling matrix in near-linear time. Additionally, it swiftly approximates the matrix product of the softmax matrix and the value matrix through subsampling. The researchers have streamlined the kernel density estimation procedure, known as KDEformer, and have demonstrated that uniform sampling is sufficient to achieve the desired spectral guarantee, eliminating the need for importance sampling based on kernel densities. This significant simplification leads to the creation of a practical and provably linear-time algorithm.
Notably, the proposed approach does not require bounded entries or bounded stable rank, and the fine-grained parameters for analyzing time complexity remain manageable, even when the entries in the attention matrix or the stable rank are large.
Furthermore, the team has observed that by conducting one-sided sampling from the squared row norms, they can eliminate the need for KDEs while achieving the same spectral norm guarantee in terms of stable rank.
In empirical testing, HyperAttention outperforms existing methods, demonstrating substantial speed improvements compared to state-of-the-art solutions such as FlashAttention. For instance, HyperAttention accelerates the inference time of ChatGLM2 by 50% when handling a context length of 32,000 tokens, with only a slight increase in perplexity from 5.6 to 6.3. In scenarios involving larger context lengths, e.g., 131,000 tokens with causal masking, HyperAttention offers a remarkable 5-fold speedup on a single attention layer.
In conclusion, the introduction of HyperAttention marks a significant breakthrough in overcoming the scalability limitations of transformers, making them more adept at handling longer context lengths. This innovation promises to enhance the efficiency and effectiveness of Large Language Models, with notable speed gains in real-world applications.
The paper HyperAttention: Long-context Attention in Near-Linear Time on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.