Researchers from Japanese electronics giant Sony have trained the ResNet-50 neural network model on ImageNet in a record-breaking 224 seconds — 43.4 percent better than the previous fastest time for the benchmark task.
Large-scale training on deep learning can lead to instability in large mini-batch training, while gradient synchronization is also burdensome as more bandwidth is required for communication among GPUs. ResNet-50 is a deep residual learning architecture for image recognition that is trained in ImageNet and widely used to measure large-scale cluster computing capability. ImageNet is an open source database for object recognition research. The ImageNet Large Scale Visual Recognition Challenge contains 1,281,167 images for training, 50,000 for validation, and 100,000 for testing.
Sony researchers applied batch size control and 2D-Tours all-reduce to overcome the problems in large-scale training, gradually increasing the total mini-batch size and making the loss landscape flat to avoid local minima. The researchers also proposed communication topology 2D-Tours Ring all-reduce, which can perform collective operations in different orientations. The new topology first performs reduce-scatter horizontally, then performs all-reduce vertically. In the last step, all-gather is performed horizontally. With the help of the 2D-Torus all reduce, communication overhead is less than that of Ring all-reduce.
Researchers used 2176 Tesla V100 GPUs for ResNet-50 training and achieved 75.03 percent validation accuracy; and also tried to improve GPU scaling efficiency without significantly reducing accuracy, achieving 91.62 percent GPU scaling efficiency using 918 Tesla V100 GPUs.
Sony’s cluster topology innovations have dramatically reduced training time, and the potential for further high-performance computing improvement is high. A combination of strong growth in GPU performance, reduction in GPU communication cost, and future cluster topology solutions will likely continue to reduce ResNet-50 training time on ImageNet.
The paper ImageNet/ResNet-50 Training in 224 Seconds is on arXiv.
Author: Alex Chen | Editor: Michael Sarazen