In the realm of computer vision, Convolutional Neural Networks (ConvNets) have long been the standard for achieving top-notch performance in various benchmarks. However, recent years have witnessed the emergence of Vision Transformers (ViTs) as a formidable contender, gradually edging out ConvNets. Many experts argue that ConvNets excel with small to moderately sized datasets, but when confronted with web-scale datasets, Vision Transformers have the upper hand.
In a new paper ConvNets Match Vision Transformers at Scale, a Google DeepMind research team challenges the prevailing belief that Vision Transformers possess superior scaling capabilities compared to ConvNets. The team conducts a comprehensive evaluation of a pure ConvNet architecture, known as the NFNet model, pre-trained on a large-scale dataset. The results reveal that ConvNets can indeed hold their own against Vision Transformers at scale.
The research team embarked on a journey to train various NFNet models, varying in depth and width, on the colossal JFT-4B dataset. This dataset boasts approximately 4 billion labeled images spanning 30,000 classes. After fine-tuning the pre-trained NFNet models over 50 epochs, the ImageNet Top-1 error consistently improved in direct correlation with the computational resources used during pre-training. The largest model, denoted as F7+, achieved performance on par with that reported for pre-trained ViTs within a comparable computational budget, reaching an impressive ImageNet Top-1 accuracy of 90.3%.
To provide a clearer understanding of the relationship between validation loss and pre-training compute, the team plotted the validation loss at the end of training against the computational budget required for each model. This exercise unveiled a discernible linear trend, consistent with a logarithmic scaling law governing the validation loss and pre-training compute. As computational resources increased, so did the optimal model size and the budget for training epochs. Moreover, it became evident that a reliable rule of thumb for scaling ConvNets is to proportionally adjust the model size and the number of training epochs.
Intriguingly, the researchers also investigated the optimal learning rate for three different models from the NFNet family (F0, F3, F7+) across a range of epoch budgets. Their findings indicated that all these models demonstrated a comparable optimal learning rate (approximately 𝛼 ≈ 1.6) when constrained by small epoch budgets. However, as the epoch budget expanded, the optimal learning rate diminished, with larger models experiencing a more rapid decline.
In summation, this research reinforces a fundamental truth in the world of computer vision: the principal factors governing the performance of a sensibly designed model are the computational resources and the volume of data available for training. It is clear from this work that ConvNets, specifically the NFNet architecture, possess the capability to compete with Vision Transformers at a scale previously thought to be their domain. The results underscore the significance of scaling compute and data resources in tandem, shedding new light on the future of computer vision research.
The paper ConvNets Match Vision Transformers at Scale on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.