Following on transformer architectures’ SOTA performance on natural language processing tasks, a new breed of vision transformers (ViT) is emerging as a game-changer in the computer vision field. ViTs however have inherited the original transformers’ hearty compute appetites, with high image processing costs that are largely due to the quadratic number of interactions between their tokens.
In the new paper AdaViT: Adaptive Tokens for Efficient Vision Transformer, an Nvidia research team proposes AdaViT, an input-dependent mechanism that adaptively adjusts ViT inference cost by halting the compute of different tokens at different depths to reserve compute only for discriminative tokens.
The researchers summarize their study’s main contributions as:
- We introduce a method for input-dependent inference in vision transformers that allows us to halt the computation for different tokens at different depths.
- We base learning of adaptive token halting on the existent embedding dimensions in the original architecture and do not require extra parameters or compute for halting.
- We introduce distributional prior regularization to guide halting towards a specific distribution and average token depth that stabilizes ACT training.
- We analyze the depth of varying tokens across different images and provide insights into the attention mechanism of vision transformers.
- We empirically show that the proposed method improves throughput by up to 62 percent on hardware with a minor drop in accuracy.
In ViTs, an encoding network tokenizes image patches into positioned tokens. The internal computational flow of transformer blocks allows the number of tokens to be changed from one layer to another, and as such the compute burden can be reduced when some tokens are dropped via an importance-based halting mechanism. Also, because ViTs can learn and capture a global halting mechanism that monitors all layers in a joint manner, designing a token halting mechanism is easier in ViTs compared to convolutional neural networks (CNNs), which require explicit handling of varying architectural dimensions.
To halt tokens adaptively, the team introduces an input-dependent halting score for each token as a halting probability (halting module). The halting score of each token is defined in a range of zero to one, and accumulative importance is used to halt tokens as inference progresses into deeper layers. To alleviate any dependency on dynamically halted tokens between adjacent layers, the researchers apply a token masking strategy to keep the computational cost of training iterations similar to the original ViT’s training cost. During the inference process, they simply remove the halted tokens from computation to measure the actual speedup gained by the halting mechanism. Finally, they incorporate this halting module into the existing ViT block by allocating a single neuron in the multilayer perceptron (MLP) layer to perform the task. The proposed halting mechanism thus requires no additional learnable parameters or compute.
The researchers evaluated AdaViT for classification tasks on the large-scale 1000-class ImageNet ILSVRC 2012 dataset at 224 ˆ 224 pixels.
Compared to baseline models, AdaViT reduced FLOPs by 39 percent without extra parameters and with only a minor loss in accuracy. Moreover, AdaViT directly improved the throughputs of DeiT small and tiny variants by 38 percent and 62 percent without any hardware modification and with only a 0.3 percent accuracy hit.
Overall, the results show that AdaViT can improve ViT throughput on hardware without the need for extra parameters or transformer block modifications, outperforming prior dynamic approaches. The team hopes their work can offer insights and inspire future research on improving the efficiency of ViTs.
The paper AdaViT: Adaptive Tokens for Efficient Vision Transformer is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.