Transformer architectures’ powerful attention mechanisms are capable of pushing SOTA performance across various natural language processing (NLP) tasks. The quadratic complexity of run time and memory usage for such attention mechanisms however has long been a critical bottleneck when processing long sequences.
In the new paper H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences, a Google Research team draws inspiration from two numerical analysis methods — Hierarchical Matrix (H-Matrix) and Multigrid — to address this quadratic complexity problem of attention mechanisms in transformer architectures. The researchers propose a hierarchical attention scheme that has linear complexity in run time and memory and achieves SOTA performance on NLP benchmarks.
In numerical analysis, the H-Matrix is employed as data-sparse approximations of non-sparse matrices, generalizing low-rank approximations via a matrix block hierarchy to substantially increase the compression rate. The Multigrid Method meanwhile is an algorithm for solving large-scale sparse matrices resulting from discretized partial differential equations using a hierarchy. Its advantage over other methods is that it can scale linearly with the number of discrete nodes used.
The team defines Multilevel Methods’ characteristics as performing no approximation for near interactions and applying progressively lower-precision approximations for progressively longer distance interactions. This leads them to hypothesize that the inductive biases embodied by the hierarchical structure of the attention matrix can be used to capture the hierarchical structure in the sequences of NLP tasks. They identify three steps to outline their hierarchical attention mechanism: 1) the attention matrix blocks for nearby, mid-distance and long-distance attention are separated, which is the first step toward distance-dependent attention; 2) a token hierarchy is established; 3) the hierarchical attention is constructed.
In general, the proposed hierarchical matrix structure establishes a hierarchical approach to the matrix-matrix multiplication and the matrix-vector multiplication. The team also shows that the overall run time and memory complexity of the hierarchical attention algorithm is O(dL), and that it only utilizes dense linear algebra operations optimized for GPUs or TPUs.
The team conducted extensive experiments to evaluate the effectiveness of the inductive biases embodied by the proposed hierarchical attention approach for capturing hierarchical structures in sequences, using the Long Range Arena (LRA) and One-Billion Word benchmarks.
The proposed hierarchical attention mechanism achieved the best performance on the LRA benchmark, bettering the previous-best Big Bird algorithm by more than six points; and established a new SOTA on the One-Billion Word benchmark, reducing test perplexity by 1.55 points compared to the previous-best Transformer-XL.
Overall, the proposed hierarchical attention approach demonstrates linear complexity in run time and memory usage and is fully compatible with dense linear algebra libraries on GPUs and TPUs. In the future, the team plans to apply the approach to music and genomics, develop proper inductive biases for cross-attention, and extend the method to 2D images.
The paper H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.