In a new paper Image Captioners Are Scalable Vision Learners Too, a DeepMind research team presents CapPa, a image captioning based pretraining strategy that and can compete CLIP and exhibit favorable model and data scaling properties, verifying that a plain image captioning can be a competitive pretraining strategy for vision backbones.
In the new paper SpectFormer: Frequency and Attention Is What You Need in a Vision Transformer, a research team from Microsoft and the University of Bath proposes Spectformer, a novel transformer architecture that combines spectral and multi-headed attention layers to better capture appropriate feature representations and improve performance.
In the new paper Semi-supervised Vision Transformers at Scale, a research team from AWS AI Labs proposes a semi-supervised learning pipeline for vision transformers that is stable, reduces hyperparameter tuning sensitivity, and outperforms conventional convolutional neural networks.
In the new paper Global Context Vision Transformers, an NVIDIA research team proposes the Global Context Vision Transformer, a novel yet simple hierarchical ViT architecture comprising global self-attention and token generation modules that enables the efficient modelling of both short- and long-range dependencies without costly compute operations while achieving SOTA results across various computer vision tasks.
A Microsoft Research team proposes FocalNet (Focal Modulation Network), a simple and attention-free architecture designed to replace transformers’ self-attention module. FocalNets exhibit significant superiority over self-attention for effective and efficient visual modelling in real-world applications.
A team from Facebook AI Research and UC Berkeley proposes ConvNeXts, a pure ConvNet model that achieves performance comparable with state-of-the-art hierarchical vision transformers on computer vision benchmarks while retaining the simplicity and efficiency of standard ConvNets.
A research team from ByteDance, Johns Hopkins University, Shanghai Jiao Tong University and UC Santa Cruz seeks to apply the proven technique of masked language modelling to the training of better vision transformers, presenting iBOT (image BERT pretraining with Online Tokenizer), a self-supervised framework that performs masked prediction with an online tokenizer.
In the new paper Can Vision Transformers Perform Convolution?, a research team from Peking University, UCLA and Microsoft Research proves that a single ViT layer with image patches as the input can perform any convolution operation constructively, and show that ViT performance in low data regimes can be significantly improved using their proposed ViT training pipeline.