The performance of contemporary AI systems on natural language processing (NLP) tasks would have been difficult to imagine just a few years ago. The 2017 debut of massive language model BERT was a game-changer, but even BERT-large’s 340 million parameters have since been eclipsed by OpenAI’s GPT-3 with its 175 billion parameters.
Training these massive deep learning models is however very computationally expensive and requires hundreds or even thousands of GPUs wired with specialized high-bandwidth interconnects. Moreover, the dependence on specialized hyperclusters for training such models has become a bottleneck.
In the new paper Varuna: Scalable, Low-cost Training of Massive Deep Learning Models, a Microsoft Research India team introduces an approach for training massive deep learning models on commodity networking. Dubbed Varanu, the system eliminates the need for specialized hyperclusters and alleviates the cost, scale, and resource utilization challenges of deep learning model training.
The team summarizes their contributions as:
- We challenge the pervasive belief (and practice) that massive models can be trained only on specialized hyperclusters, by presenting the first system that is capable of training massive deep learning models on spot VMs with commodity networking, achieving 4-5x lower cost of training these models.
- We argue and demonstrate that intra-layer partitioning is ill-suited not only for commodity networking but that they are not the best performing option even in hyperclusters.
- We introduce a novel concept of correctness-preserving job morphing to automatically reconfigure a running DLT job, to adapt to changing number of GPUs, using a combination of data and model parallelism.
- We demonstrate the efficacy of this approach by efficiently scaling to a 200 billion parameter model, and showing significant speedups on other large models such as BERT-large and Megatron-8.3B. We also demonstrate that despite using a large batch size, Varuna achieves state-of-the-art accuracy on a 2.5 billion parameter GPT-2 model.
Varuna has a high-level architecture that uses a combination of data parallelism and pipeline model parallelism. For data parallelism, models are partitioned into several partitions, each with multiple replicas, running like a data-parallel job. For pipeline model parallelism, each stage receives input activations from the previous stage, performs a forward pass computation, then sends the output activations to the next stage. Each stage also gets gradients from the next stage, performs a backward pass computation, then sends input gradients to the former stage.
The researchers reduce cost by harnessing low-priority virtual machines (VMs) that are 4-5x cheaper than dedicated GPU VMs. Varuna uses a novel job-morphing technique to dynamically configure a job to run at best performance with available resources and employs scale-invariant calibration and parametrized simulation to identify the best configuration.
Varuna also applies auto-partitioning and provides a tracer that detects and tracks cross-partition dependencies to improve its ease-of-use for programmers and developers.
The team compared Varuna against prior systems, including BERT-large and GPT-2 models with 2.5 billion, 8.3 billion, 20 billion, and 200 billion parameters, to show how Varuna navigates the dynamism of spot VM availability while maintaining high training performance and how its performance gains directly translate to faster time-to-convergence.
In the evaluations, Varuna improved performance by up to 18x compared to state-of-the-art approaches. Moreover, despite the commodity networking across these “low priority” VMs, Varuna also outperformed state-of-the-art approaches that run on specialized hyperclusters by 20 to 78 percent.
Overall, the study shows that Varuna is able to train large-scale deep learning models with low cost, high performance, and at higher scale, while eliminating the dependency on specialized hyperclusters. The researchers believe Varuna could significantly accelerate the pace of innovation in large-scale deep learning models.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.