As Facebook struggles with fallout from the Cambridge Analytica scandal, its research arm today delivered a welcome bit of good news in deep learning. Research Engineer Dr. Yuxin Wu and Research Scientist Dr. Kaiming He proposed a new Group Normalization (GN) technique they say can accelerate deep neural network training with small batch sizes.
Although deep learning thrives with complex neural networks and large datasets, training a model requires much time and power. This has prompted AI researchers to rethink the normalization techniques they use to reduce training costs.
Facebook AI Research had already taken a few steps forward. Last June, it proposed an accurate, large minibatch SGD technique that can train ResNet50 with a minibatch size of 8192 on 256 GPUs in only one hour, while matching small minibatch accuracy.
The mainstream normalization technique for almost all convolutional neural networks today is Batch Normalization (BN), which has been widely adopted in the development of deep learning. Proposed by Google in 2015, BN can not only accelerate a model’s converging speed, but also alleviate problems such as Gradient Dispersion in the deep neural network, making it easier to train models.
Dr. Wu and Dr. He however argue in their paper Group Normalization that normalizing with batch size has limitations, as BN cannot ensure the model accuracy rate when the batch size becomes smaller. As a result, researchers today are normalizing with large batches, which is very memory intensive, and are avoiding using limited memory to explore higher-capacity models.
Dr. Wu and Dr. He believe their new GN technique is a simple but effective alternative to BN. Specifically, GN divides channels — also referred to as feature maps that look like 3D chunks of data — into groups and normalizes the features within each group. GN only exploits the layer dimensions, and its computation is independent of batch sizes.
The idea of GN was inspired by many classical image features like SIFT and HOG, which involve group-wise normalization. The paper states, “For example, a HOG vector is the outcome of several spatial cells where each cell is represented by a normalized orientation histogram.”
The paper reports that GN had a 10.6% lower error rate than its BN counterpart for ResNet-50 in ImageNet with a batch size of 2 samples; and matched BN performance while outperforming other normalization techniques with a regular batch size. It is worth noting that Dr. He is the main contributor to the development of ResNet (Deep Residual Network).
GN also outperformed BN on other neural networks, such as Mask R-CNN for COCO object detection and segmentation, and 3D convolutional networks for Kinetics video classification.
GN is not the first attempt to replace BN. Layer Normalization (LN), proposed in 2016 by a University of Toronto team led by Dr. Geoffrey Hinton; and Instance Normalization (IN), proposed by Russian and UK researchers, are also alternatives for normalizing batch dimensions. While LN and IN are effective for training sequential models such as RNN/LSTM or generative models such as GANs, GN appears to present a better result in visual recognition.
Journalist: Tony Peng| Editor: Michael Sarazen