Masked language modelling (MLM) is a pretraining paradigm that tokenizes text into semantically meaningful pieces. Although MLM is the main contributor to the remarkable performance of transformers on natural language processing tasks, its potential application in the emerging visual transformers (ViT) that are revolutionizing computer vision research remains relatively underexplored.
In a new paper, a research team from ByteDance, Johns Hopkins University, Shanghai Jiao Tong University and UC Santa Cruz seeks to apply MLM to the training of better ViTs, presenting iBOT (image BERT pretraining with Online Tokenizer), a self-supervised framework that performs masked prediction with an online tokenizer.
MLM pretrained transformers have demonstrated their success and scalability across a range of lingual tasks, and this has led many researchers working in computer vision to wonder whether ViTs might also benefit from some form of MLM.
To find the answer, the researchers explore masked image modelling (MIM) and the advantages and challenges of using a semantically meaningful visual tokenizer. The team first identifies the lingual tokenizer, which aims to transform language into semantically meaningful tokens, as the most crucial MLM component. They propose that enabling MIM requires designing a lingual tokenizer-like component — a visual tokenizer — to transform masked patches to supervisory signals for the target model. This task is challenging, as unlike lingual semantics, which appear naturally from the statistical analysis of word frequency, visual semantics cannot be extracted easily due to the continuous property of images.
The researchers created iBOT to perform MIM with the use of a well-designed visual tokenizer. They formulate the MIM as knowledge distillation (KD) to learn to distill knowledge from the tokenizer, and propose performing self-distillation for MIM with the help of a twin teacher as the online tokenizer. In this way, the target network can take masked images as inputs while the online tokenizer preserves the original images. The goal is to train the target network to learn to recover each masked patch token to its corresponding tokenizer output.
The team identifies two natural advantages of their tokenizer: 1) It captures high-level visual semantics progressively learned by enforcing the similarity of cross-view images on class tokens; 2) It requires no additional training stages in the pre-processing setup since it is jointly optimized with MIM via a momentum update.
In their empirical study, the team evaluated iBOT on the ImageNet-1K classification benchmark with five protocols: k-NN, linear probing, fine-tuning, semi-supervised learning, and unsupervised learning. They also transferred iBOT to downstream tasks such as object detection and instance segmentation on COCO, and semantic segmentation on ADE20K.
The results show that iBOT advances the ImageNet-1K classification benchmark under k-NN (77.1%), linear probing (79.5%) and fine-tuning protocols (83.8%), which is 1.0%, 1.3%, and 0.2% higher than the previous best results. Along with its state-of-the-art image classification performance, iBOT also bettered previous results on all three downstream tasks.
Overall, this work demonstrates the promising potential of BERT-like pretraining for ViTs and that an MIM approach can not only achieve high detection accuracy but also improve robustness against common image corruptions.
The paper iBOT: Image BERT Pre-Training with Online Tokenizer is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.