Transformer architectures have demonstrated impressive performance improvements since their introduction in 2017 and are now the standard in the natural language processing and computer vision research fields. The wider real-world application of transformers is however limited by the massive computation and memory demands of their self-attention mechanisms when capturing diverse syntactic and semantic representations from long input sequences.
In the new paper Transformers with Multiresolution Attention Heads, researchers propose MrsFormer, a novel transformer architecture that uses Multiresolution-head Attention (MrsHA) to approximate output sequences. Compared to a baseline softmax transformer, MrsFormer significantly reduces head redundancy without sacrificing accuracy. The paper is currently under double-blind review for ICLR 2023, and as such, the author and institution names remain masked.
The team summarizes their main contributions as follows:
- We derive the approximation of an attention head at different scales via two steps: i) Directly approximating the output sequence H, and ii) approximating the value matrix V, i.e. the dictionary that contains bases of H.
- We develop MrsHA, a novel MHA whose attention heads approximate the output sequences Hh , h = 1, . . . , H, at different scales. We then propose MrsFormer, a new class of transformers that use MrsHA in their attention layers.
- We empirically verify that the MrsFormer helps reduce the head redundancy and achieves better efficiency than the baseline softmax transformer while attaining comparable accuracy to the baseline.
Unlike transformers’ standard self-attention mechanism, which learns long-sequence representations by comparing input sequence tokens and modifying the corresponding output sequence positions, the proposed MrsFormer leverages multiresolution approximation (Mallat, 1999; 1989; Crowley, 1981). The proposed method decomposes multi-head attention into fine-scale and coarse-scale heads and models attention patterns between tokens and token groups to calculate an approximation of the attention heads at different scales and directly approximate the output sequence and value matrix.
In their empirical study, the team compared MrsFormer with a baseline softmax transformer on tasks that included image and time-series classification. The results show that computing the attention heads with MrsFormer reduces head redundancy and cuts computation and memory costs while maintaining accuracy comparable to the baseline.
Overall, this work demonstrates the potential of a novel class of efficient transformers for significantly reducing computation and memory costs without sacrificing model performance.
The paper Transformers with Multiresolution Attention Heads is on OpenReview.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.