This year marks the 10th anniversary of the epoch-making paperImageNet Classification with Deep Convolutional Neural Networks and the associated model’s convincing victory in the ImageNet Large Scale Visual Recognition Challenge. The computer vision field was revitalized by these powerful convolutional neural networks (ConvNets), which evolved at a rapid pace over the decade.
In recent years, vision transformer (ViTs) architectures — particularly hierarchical designs such as Swin Transformers — have successfully challenged ConvNets as researchers’ favoured generic backbone for computer vision tasks such as image classification, as they are considered more accurate, efficient, and scalable.
In the new paper A ConvNet for the 2020s, a team from Facebook AI Research and UC Berkeley investigates the architectural distinctions between ConvNets and transformers. They take a standard ResNet model designed for computer vision tasks and gradually “modernize” the architecture to the construction of a hierarchical ViT to explore how design decisions in transformers impact ConvNets’ performance. The team leverages their findings to propose ConvNeXts, a pure ConvNet model that achieves performance comparable with state-of-the-art hierarchical ViTs on various computer vision benchmarks while retaining the simplicity and efficiency of standard ConvNets.
The introduction of ViTs in 2020 revolutionized network architecture design. With the help of larger model and dataset sizes, ViTs surpassed standard ResNets by a significant margin. Hierarchical models such as the Swin Transformer were a milestone work in this regard, demonstrating for the first time that transformers can be adopted as a generic vision backbone and achieve state-of-the-art performance across a wide range of computer vision tasks.
The impressive achievements of hierarchical transformers however comes at a cost: their implementation can be expensive due to their “sliding window” self-attention mechanism (enabling attention within local windows), and although advanced approaches can be used to optimize speed, doing so makes system design much more complex.
The researchers point out that many transformer advancements for computer vision have aimed at bringing back convolutions, and that, ironically, ConvNets already satisfy most of these desired properties. A question naturally arises: Is it possible to retain the simplicity and efficiency of standard ConvNets while competing favourably with state-of-the-art hierarchical ViTs on computer vision benchmarks?
To answer this question, the team explores the architectural distinctions between ConvNets and transformers to determine how design decisions in transformers can impact ConvNet performance. Based on their discoveries of key components that contribute to the performance differences, they propose ConvNeXts, models constructed entirely from standard ConvNet modules.
The team first trains a baseline model ResNet50/200 using a ViT training procedure, then employs modern training techniques such as the AdamW optimizer, Mixup, and label smoothing to determine to what extent they can enhance model performance. This enhanced training scheme boosts ResNet-50 model performance from 76.1 percent to 78.8 percent, indicating that training techniques are one factor that contributes to the performance differences between traditional ConvNets and ViTs.
The team then analyzes Swin Transformers’ macro network design with regard to the stage compute ratio and the “stem cell” structure. They observe that the appropriate stage compute ratio can improve model accuracy, and that the stem cell in a ResNet may be substituted with a simpler “patchify” layer that will result in performance similar to ViTs.
The team also investigates several other architectural differences at macro scale and micro scale, such as increasing kernel size, using fewer activation functions, using fewer normalization layers, separating downsampling layers, etc.
The team’s explorations led to a number of design decisions that contributed to the proposed ConvNeXt model, such as changing the state-compute ratio and using a “patchify stem” (4×4 non-overlapping convolution) in the network, employing the ResNeXt design to improve the FLOPs/accuracy trade-off, using inverted bottlenecks, a 7×7 depthwise conv and a single GELU activation in each block, using one LayerNorm for normalization in each residual block, and using separate downsampling layers.
To evaluate the proposed ConvNeXt models, the team conducted experiments on a number of computer vision benchmark tasks, including image classification on the ImageNet-1K dataset, object detection and segmentation on COCO, and semantic segmentation on ADE20K.
In the tests, the proposed ConvNeXts obtained 87.8 percent ImageNet top-1 accuracy, outperformed Swin Transformers on COCO detection and ADE20K segmentation, and achieved performance competitive with ViTs in terms of accuracy and scalability. The researchers believe their results will challenge several widely held views and may prompt researchers to rethink the importance of convolutions in computer vision.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.