Recent studies have shown that vision transformer (ViT) models can attain better results than most state-of-the-art convolutional neural networks (CNNs) across various image recognition tasks, and can do so while using considerably fewer computational resources. This has led some researchers to propose ViTs could replace CNNs in this field.
However, despite their promising performance, ViTs are sensitive to the choice of optimizer, the selection of dataset-specific learning hyperparameters, network depth and other factors. CNNs, in contrast, are far more robust and exceptionally easy to optimize.
These ViT/CNN trade-offs inspired a recent study by a research team from Facebook AI and UC Berkeley that proposes a best-of-both-worlds solution. In the paper Early Convolutions Help Transformers See Better, the team employs a standard, lightweight convolutional stem for ViT models that dramatically increases optimizer stability and improves peak performance without sacrificing computational efficiency.
ViTs offer the computer vision community a novel alternative to CNNs, and an increasing amount of research is being applied to improve ViTs via multi-scale networks, increased depth, locality priors, etc. This paper takes a new tact, focusing on ViT models’ unstable optimization issues. The team conjectures that the optimizability differences between ViTs and CNNs lie primarily in the early visual processing performed by ViTs, which “patchifies” the input image into non-overlapping patches to form the transformer encoder’s input set. This ViT patchify stem is implemented as a large-stride convolution, whereas studies have shown that in typical CNN designs, best-practices converge to a smaller stack of stride-two 3×3 kernels as the network’s stem.
Based on these factors, the team decided to restrict convolutions in ViT to early visual processing by replacing the patchify stem with its convolutional stem counterpart and removing one transformer block to compensate for the convolutional stem’s extra flops.
To compare the stability of ViT models with the original patchify (P) stem (ViTP) against the convolutional (C) stem (ViTC), the team conducted experiments on standard ImageNet-1k datasets and reported top-1 error. They chose RegNetY, a state-of-the-art CNN which is easy to optimize, as their good-stability baseline.
In experiments testing how rapidly networks converge to their asymptotic error, the ViTC converged faster than ViTP on all 50, 100, and 200 epoch schedules. The shortest training schedule (50 epochs) witnessed the most significant improvement: ViTP -1GF had a ten percent top-1 error, while ViTC -1GF reduced this to about six percent.
The researchers also explored how well AdamW and SGD optimize ViT models with the two stem types, with the results indicating that ViTP models suffer a dramatic drop when trained with SGD across all settings, while ViTC models exhibit much smaller error gaps between SGD and AdamW across all training schedules and model complexities.
Further, the results from peak performance experiments confirmed that ViTC’s convolutional stem improves not only optimization stability but also model accuracy, delivering ∼1-2 percent top-1 accuracy improvements on ImageNet-1k while maintaining flops and runtime.
Overall, the study shows that simply replacing the ViT patchify stem with a standard convolutional stem in early visual processing results in marked improvements in terms of optimizer stability and final model accuracy.
The paper Early Convolutions Help Transformers See Better is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.