Building on the epoch-making performance of transformer architectures in natural language processing (NLP), the vision transformer (ViT) has emerged as one of the most advanced architectures for computer vision (CV) tasks, demonstrating excellent capabilities in modelling both short- and long-range information compared to conventional convolutional neural network (CNN) approaches. The main bottleneck limiting further ViT development and deployment is its quadratic computational complexity, which makes the modelling of high-resolution images prohibitively expensive.
In the new paper Global Context Vision Transformers, an NVIDIA research team proposes the Global Context Vision Transformer (GC ViT), a novel yet simple hierarchical ViT architecture comprising a global self-attention and token generation modules that enables the efficient modelling of both short- and long-range dependencies without costly compute operations while achieving SOTA results across various computer vision (CV) tasks.
The team summarizes their main contributions as:
- A novel hierarchical Transformer model called GC ViT that can be employed as a general backbone in various computer vision tasks such as classification, detection and instance segmentation.
- A novel yet simple design comprising global self-attention and token generation modules that allows for modelling long-range dependencies by capturing global contextual information and hence eliminates the need for highly sophisticated or complex operations.
- The proposed GC ViT achieves new SOTA benchmarks on the ImageNet-1K dataset for a variety of model sizes and FLOPs, outperforming both CNN and ViT-based models by a significant margin. Using GC ViT as the backbone yields SOTA or competitive performance for object detection and semantic segmentation on the MS COCO and ADE20K datasets, respectively.
The GC ViT architecture is a hierarchical framework that captures feature representations at multiple resolutions. Given an input image, the model obtains overlapping patches by applying a specified convolutional layer with appropriate padding.
Each GC ViT processing stage employs alternating local and global self-attention modules for spatial feature extraction. The global self-attention accesses global features extracted by a novel Global Token Generator (GTG), and the resulting features are passed through average pooling and linear layers to generate an embedding for downstream tasks.
In their empirical studies, the team evaluated the proposed GC ViT on CV tasks such as image classification, objection detection, instance segmentation and semantic segmentation.
In the evaluations, GC ViT models achieved a new SOTA image classification score of 84.4 percent Top-1 accuracy on the ImageNet-1K dataset; and consistently surpassed both ConvNeXt and Swin Transformer baselines by a significant margin. GC ViT also obtained SOTA or competitive results in object detection and semantic segmentation tasks on the MS COCO and ADE20K datasets.
Overall, this work demonstrates the proposed GC ViT’s ability to effectively capture global context and reach SOTA performance on CV tasks. While GC ViT does not increase the computational cost, the paper notes that — as with any transformer architecture — training remains relatively expensive, and suggests adopting techniques such as limited precision or quantization could enable more efficient GC ViT training.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.