Transformer architectures are transforming computer vision. Introduced in 2020, the Vision Transformer (ViT) globally connects patches across spatial and temporal dimensions, and has largely replaced convolution neural networks (CNNs) as the modelling choice for researchers in this field.
In the new paper Video Swin Transformer, a research team from Microsoft Research Asia, University of Science and Technology of China, Huazhong University of Science and Technology, and Tsinghua University takes advantage of the inherent spatiotemporal locality of videos to propose a pure-transformer backbone architecture for video recognition that leads to better speed-accuracy trade-offs and achieves state-of-the-art performance on a wide range of video recognition benchmarks.
The great successes of contemporary image transformers have inspired the computer vision community to develop transformer-based architectures for video-based recognition tasks. Examples include this February’s Video Transformer Network (VTN), which added a temporal attention encoder on top of a pretrained ViT to yield better performance; and April’s trained-from-scratch Multiscale Vision Transformer (MViT), which reduced computation by pooling attention for spatiotemporal modelling. While such models are based on global self-attention modules, the Video Swin Transformer researchers say theirs is the first study to investigate spatiotemporal locality biases, and that this approach surpasses the performance of previous global self-attention based models.
The proposed Video Swin Transformer strictly follows the hierarchical architecture of March’s Swin Transformer for image recognition, which comprises four stages and performs two-times spatial downsampling in the patch merging layer of each stage. The major component in the new architecture is the Video Swin Transformer block, which consists of a 3D shifted window based multihead self-attention (MSA) module followed by a feed-forward network.
Videos have a temporal dimension not found in images, and so they require a much higher number of tokens. This leads to huge computation and memory burdens when using a global self-attention module. To reduce these computational costs, the team introduced a locality inductive bias to the self-attention module. They also extended Swin Transformer’s shifted 2D window mechanism to 3D windows to enforce cross-window connections while maintaining the efficient computation of non-overlapping window-based self-attention.
The team compared their proposed Video Swin Transformer to various state-of-the-art convolution-based and transformer-based architectural backbones on the Kinetics-400, Kinetics-600 and Something-Something v2 datasets.
Video Swin Transformer achieved 84.9 top-1 accuracy on Kinetics-400, 86.1 top-1 accuracy on Kinetics-600 with ∼20× less pre-training data and ∼3× smaller model size, and 69.6 top-1 accuracy on Something-Something v2. The results demonstrate the superior performance of the proposed spatiotemporal locality bias approach relative to global self-attention based methods and other vision transformers on video recognition tasks.
The code is available on project GitHub. The paper Video Swin Transformer is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.