While vision transformers (ViTs) have achieved impressive performance in computer vision and advanced the state-of-the-art for various vision tasks, a bottleneck impeding further progress with ViTs in this area is their quadratic complexity.
In the NeurIPS 2021 spotlight paper SOFT: Softmax-free Transformer with Linear Complexity, researchers from Fudan University, University of Surrey and Huawei Noah’s Ark Lab identify the limitations of quadratic complexity for ViTs as rooted in keeping the softmax self-attention during approximations. To alleviate this computational burden, the team proposes the first softmax-free transformer (SOFT), which reduces self-attention computation to linear complexity, achieving a superior trade-off between accuracy and complexity.
The team summarizes their study’s main contributions as:
- We introduce a novel softmax-free transformer with linear space and time complexity.
- Our attention matrix approximation is achieved through a novel matrix decomposition algorithm with a theoretical guarantee.
- To evaluate our method for visual recognition tasks, we design a family of generic backbone architectures with varying capacities using SOFT as the core self-attention component. Extensive experiments show that with a linear complexity, our SOFT models can take in as input much longer image token sequences. As a result, with the same model size, our SOFT outperforms the state-of-the-art CNNs and ViT variants on ImageNet classification in the accuracy/complexity trade-off.
In traditional ViTs, given a sequence of tokens with each token represented by a d-dimensional feature vector, a self-attention mechanism aims to discover the correlations of all token pairs, thus producing the problematic quadratic complexity. The proposed SOFT instead employs a softmax-free self-attention function with the dot-product replaced by a Gaussian kernel. To solve the convergence and quadratic complexity issues, the researchers leverage low-rank regularization, which enables SOFT model complexity to be reduced significantly by not computing the full self-attention matrix.
The team evaluated the proposed SOFT on the ILSVRC-2012 ImageNet-1K dataset, reporting top-1 accuracy for model performance, and model size and floating-point operations to assess cost-effectiveness.
SOFT achieved the best performance in the experiments, bettering recent pure vision transformer based methods ViT and DeiT, as well as the state-of-the-art CNN RegNet; and outperformed all variants of its most architecturally similar counterpart, the Pyramid Vision Transformer (PVT).
Overall, the study shows that SOFT’s novel design eliminates the need for softmax normalization and yields a superior trade-off between accuracy and complexity.
The paper SOFT: Softmax-free Transformer with Linear Complexity is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.