Since their introduction in 2020, vision transformers (ViTs) have achieved impressive performance across various computer vision tasks under supervised learning settings — training with labelled data. ViTs however require more training data and have a weaker inductive bias than their convolutional neural network (CNN) counterparts; and so have struggled under semi-supervised learning (SSL) settings, where only a small fraction of the training data is labelled.
In the new paper Semi-supervised Vision Transformers at Scale, a research team from AWS AI Labs proposes Semi-ViT, an SSL pipeline for ViTs that is stable, reduces hyperparameter tuning sensitivity, and outperforms conventional CNNs.
The proposed Semi-ViT SSL pipeline comprises three stages: 1) Optional self-supervised pretraining on all data without any labels, 2) Standard supervised fine-tuning on all available labelled data, and 3) Semi-supervised fine-tuning on both labelled and unlabelled data.
In the semi-supervised fine-tuning stage, the team adopts the exponential moving average (EMA)-Teacher framework rather than the recently popular FixMatch student-teacher framework. The paper explains that in many cases, FixMatch with ViT does not converge and generally underperforms CNNs. The EMA-Teacher framework meanwhile is more stable and achieves better performance with semi-supervised ViTs.
The team also introduces a “probabilistic pseudo mixup” mechanism to interpolate unlabelled samples and their pseudo labels. This approach can leverage information from all samples, enhance regularization, and alleviate ViTs’ weak inductive bias to realize substantial performance gains.
The team evaluated the proposed Semi-ViT SSL pipeline with state-of-the-art SSL models and fully supervised models on the benchmark ImageNet dataset. The results show that Semi-ViT outperforms state-of-the-art SSL approaches such as MPL-RN-50 and CowMix-RN152 without any additional parameters or architectural changes and achieves performance comparable with other supervised models. Moreover, Semi-ViT also has impressive scalability capability, achieving an 80 percent top-1 accuracy on ImageNet with only 1 percent labels.
Overall, this work shows that semi-supervised ViTs can surpass their CNN counterparts, demonstrating a promising new potential for advancing self-supervised learning.
The paper Semi-supervised Vision Transformers at Scale is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.