New research from UK-based Google subsidiary DeepMind demonstrates for the first time that image recognition models can be trained without batch normalization layers. The study introduces a novel clipping algorithm to produce models that match or exceed top batch normalized models’ classification accuracies on large-scale datasets while also significantly reducing training time.
Batch normalization is a favourite technique for deep residual networks (ResNets) training due to its ability to accelerate training, enable higher learning rates, and improve generalization accuracy. Although batch normalization has proliferated through the deep learning research community, it suffers from three significant practical disadvantages: it is expensive both in memory and time; it introduces discrepancies between model behaviours during training and inference time, necessitating additional fine-tuning; and it breaks the independence between training examples in the minibatch.
Although some recent studies have succeeded in training deep ResNets without batch normalization, the resulting models are often unstable for large learning rates or strong data augmentations, and their performance cannot compete with SOTA batch-normalized networks.
To address these weaknesses, the DeepMind team designed a family of Normalizer-Free ResNets (NFNets) that can be trained in larger batch sizes and stronger data augmentations and have set new SOTA validation accuracies on ImageNet.
The researchers’ strategy for training Normalizer-Free Networks with larger batch sizes and stronger data augmentations is based largely on their Adaptive Gradient Clipping (AGC), “a relaxation of normalized optimizers,” which clips gradients based on the unit-wise ratio of gradient norms to parameter norms.
To assess AGC efficiency, the researchers used a range of ablations comparing batch-normalized ResNets to NF-ResNets with and without AGC. The results show that AGC efficiently scales NF-ResNets to larger batch sizes.
Building on AGC, the researchers trained a family of Normalizer-Free architectures (NFNets) and applied them to a SE-ResNeXt-D model — a strong baseline for Normalizer-Free Networks — with modified width, depth patterns, and a second spatial convolution. Finally, they applied AGC to every parameter except for the linear weight of the classifier layer.
In experiments, the researchers compared their NFNet models’ accuracy with a set of representative models — SENet (Hu et al., 2018), LambdaNet, (Bello, 2021), BoTNet (Srinivas et al., 2021), and DeIT (Touvron et al., 2020) — on ImageNet. To validate the NormalizerFree networks’ suitability to transfer learning after large-scale pretraining, they tested NFNets in the transfer learning regime via pretraining on a dataset of 300 million labelled images.
The proposed NFNet-F5 model attained a top-1 validation accuracy of 86.0 percent, outperforming the previous SOTA model EfficientNet-B8. The NFNet-F1 model also matched the EfficientNet-B7 model performance, while being 8.7 times faster to train. In addition, Normalizer-Free models achieved surprisingly better performance than their batch-normalized counterparts when fine-tuning on ImageNet after large-scale pretraining, obtaining an accuracy of 89.2 percent.
The paper High-Performance Large-Scale Image Recognition Without Normalization is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.