Deep learning (DL) model size is growing exponentially — from the fewer than 100 million parameters in 2017’s largest language model to the whopping 175 billion parameters in 2020’s GPT-3. Training these large models however has become extremely expensive and inaccessible to all but a few AI researchers and institutions.
In an effort to democratize the process, researchers from University of California, Merced and Microsoft have introduced ZeRO-Offload, a novel heterogeneous DL training technology that enables training of multi-billion parameter models on a single GPU without any model refactoring.
Many studies have used heterogeneous DL training to reduce GPU memory requirements by exploiting CPU memory, but these target activation memory on smaller-sized CNN-based models. Challenges faced in attention-based large model training include the model states (parameters, gradients and optimizer states), as well as a lack of research on exploiting CPU compute. The researchers explain that ZeRO-Offload exploits both CPU memory and compute for offloading, offering a clear path toward efficiently scaling on multiple GPUs by working with ZeRO-powered data parallelism.
“Efficiency, scalability and usability” inform the ZeRO-Offload design. The researchers identify a unique optimal computation and data partitioning strategy between CPU and GPU devices: offloading gradients, optimizer states and optimizer computation to CPU; and keeping parameters and forward and backward computation on GPU. The approach achieves a 10x increase in model size with minimum communication and limited CPU computation, enabling the training of 13B parameters on a single NVIDIA V100 GPU at 40 TFLOPS, compared to 30 TFLOPS on the same GPU with 1.2B parameters, the largest model trainable without CPU offloading.
Traditional data parallelism is the community standard for scaling DL training to multiple GPUs, but requires the replication of data and computation, making it unsuitable for heterogeneous training. ZeRO-Offload addresses this by using a ZeRO-powered data parallelism and maintaining a single copy of the optimizer states on the CPU memory regardless of the data parallel degree, resulting in promising scalability of up to 128 GPUs.
Unlike strategies that require model refactoring, ZeRO-Offload is available as part of the OpenSource PyTorch library DeepSpeed, and can be added to existing training pipelines by changing just a few lines of code.
ZeRO-Offload’s compute and memory efficiency and ease-of-use make large-scale model training accessible even to researchers working with a single GPU. The paper ZeRO-Offload: Democratizing Billion-Scale Model Training is on arXiv.
Analyst: Reina Qi Wan | Editor: Michael Sarazen