Vision transformer architectures (ViTs) have achieved compelling performance across many computer vision tasks, often outperforming classical convolutional architectures. A question arises: Is the impressive performance of ViTs due to their powerful transformer architecture and attention mechanisms, or is there some other factor that gives ViTs their edge?
In the paper Patches Are All You Need, which is currently under double-blind review for the International Conference on Learning Representations (ICLR 2022), a research team proposes ConvMixer, an extremely simple model (about 6 lines of dense PyTorch code) designed to support the hypothesis that ViT performance is mainly attributable to the use of patches as the input representation. The study shows that ConvMixer can outperform ViTs, MLP-Mixers and classical vision models.
Andrej Karpathy, Tesla’s Senior Director of AI, tweeted, “I’m blown away by the new ConvMixer architecture.”
Errr ok wow, I am shook by the new ConvMixer architecturehttps://t.co/crUMktQ0ig "the first model that achieves the elusive dual goals of 80%+ ImageNet top-1 accuracy while also fitting into a tweet" 😐 pic.twitter.com/898EvpJVUl— Andrej Karpathy (@karpathy) October 7, 2021
ConvMixer comprises a patch embedding layer followed by repeated applications of a simple fully-convolutional block. The ConvMixer block itself consists of depthwise convolution followed by pointwise convolution, with each of the convolutions followed by an activation and post-activation BatchNorm process.
As the name suggests, the general idea behind ConvMixer is mixing. The researchers use depthwise convolution to mix spatial locations and pointwise convolution to mix channel locations. They also use convolutions with an unusually large kernel size to mix distant spatial locations, enabling them to observe the effects of the patch representation itself in contrast to the conventional pyramid-shaped design of convolutional networks.
In their empirical study, the team evaluated ConvMixer on ImageNet-1k classification without any pretraining or additional data. They added ConvMixer to the timm framework and used RandAugment, mixup, CutMix, random erasing and gradient norm clipping in addition to default timm augmentation.
In the experiments, a ConvMixer-1536/20 with 52M parameters achieved 81.4 percent top-1 accuracy on ImageNet, and a ConvMixer-768/32 with 21M parameters reached 80.2 percent. Moreover, despite its extreme simplicity, ConvMixer outperformed both “standard” computer vision models such as ResNet and corresponding vision transformer and MLP-Mixer variants.
Overall, the results suggest that patch representation itself may be the component most responsible for the outstanding performance of ViTs. The team believes their work can provide a strong “convolutional-but-patch-based” baseline for comparing future advanced architectures.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.