In the ICCV (International Conference on Computer Vision) 2021 paper awards announced last month, a team from Microsoft Asia Research was honoured with the coveted ICCV 2021 Marr Award for best paper for their submission Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. The authors include Ze Liu from the University of Science and Technology of China, Yutong Lin from Xi’an Jiaotong University, and Yue Cao and Han Hu from Microsoft. In the paper, which was selected from over 1,600 submissions, the researchers proposed Swin Transformer, a new vision transformer (ViT) that can be used as a generic backbone in computer vision.
Compared with previous ViT models, Swin Transformer introduced two improvements: adding the hierarchical construction method commonly used in CNNs to build a hierarchical transformer, and incorporating the concept of locality to compute self-attentiveness in the window region without overlap. Soon after the Swin Transformer paper was released, Microsoft open-sourced the code and pretrained models, which cover image classification, target detection and semantic segmentation tasks.
This week, the same research team introduced Swin Transformer V2, an upgraded version that can scale up to three billion parameters, is capable of training with images of up to 1,536×1,536 resolution, and advances the SOTA on four vision task benchmarks.
Scaling up language models has proven incredibly successful in improving performance on natural language processing (NLP) tasks, but similar scaling up of vision models had remained relatively underexplored. The new paper aims to leverage scaling power for computer vision tasks, and the researchers identify two key issues associated with this challenge: 1) Experiments with large vision models reveal an instability issue in training; and 2) The discrepancy of activation amplitudes across layers becomes significantly greater in large models.
To address these issues, the team proposed post-norm, a novel normalization configuration designed to move the LN layer from the beginning of each residual unit to the backend, producing much milder activation values across the network layers. They also introduced scaled cosine attention to make the computation irrelevant to amplitudes of block inputs, such that attention values are less likely to fall into extremes.
Many downstream vision tasks require high-resolution input images or large attention windows, and window size variations between low-resolution pretraining and high-resolution fine-tuning can be quite large. The team thus proposed Log-CPB, a log-spaced continuous position bias, to generate biases for arbitrary coordinate ranges by using a small meta network. This innovation enables a pretrained model to freely transfer across window sizes by sharing weights of the meta network.
The scaling up of model capacity and resolution can also result in prohibitively high GPU memory consumption. To resolve this issue, the team incorporated techniques such as a zero optimizer, activation check-pointing, and a novel implementation of sequential self-attention computation. Collectively, these enable the GPU memory consumption to be significantly reduced with only a marginal effect on training speed.
Leveraging the abovementioned techniques, the team successfully trained a three-billion-parameter Swin Transformer model and effectively transferred it to various vision tasks with image resolutions as large as 1,536×1,536.
To evaluate the performance of their Swin Transformer V2, the researchers conducted experiments on ImageNet-1K image classification, COCO object detection, ADE20K semantic segmentation and Kinetics400 video action recognition.
The proposed Swin Transformer set new records on four representative vision benchmarks: 84.0 percent top-1 accuracy on ImageNet-V2 image classification, 63.1/54.4 box/mask mAP on COCO object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8 percent top-1 accuracy on Kinetics-400 video action classification.
The team hopes their work will encourage additional research in this direction, and that it will eventually be possible to close the capacity between vision and language models and facilitate joint modelling of the two domains.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.