Large-scale transformer-based language models have produced substantial gains in the field of natural language processing (NLP). Training such models however is challenging, for two reasons: No single GPU has enough memory to accommodate parameter totals which have grown exponentially in recent years, and even if there were a way to train these parameters on single GPU, limited computing power would result in unrealistically long training times without model parallelism.
In the paper Efficient Large-Scale Language Model Training on GPU Clusters, a research team from NVIDIA, Stanford University and Microsoft Research propose a novel parallelization schedule which improves throughput by more than 10 percent with a comparable memory footprint, showing that such strategies can be composed to achieve high aggregate throughput (502 petaFLOP/s) while training large models with up to a trillion parameters.
The researchers first introduce techniques that combine data parallelism with tensor model parallelism and pipeline model parallelism to facilitate the efficient training of large models.
With data parallelism, each “worker” has a copy of the full model, and the input dataset is sharded. The workers periodically aggregate their gradients so they all maintain a consistent version of the weights. In pipeline parallelism, the layers of a model are sharded across multiple devices. Because pipelining schemes must ensure that inputs see consistent weight versions across forward and backward passes, the researchers look at two scheduling approaches: default scheduling and scheduling with interleaved stages.
The team observes that the default schedule approach has a high memory footprint, as it requires stashed intermediate activations to be kept in memory. They thus opt for a modified PipeDream-Flush schedule, which is much more memory-efficient. Although the scheduling with interleaved stages approach is able to reduce pipeline bubble size, it also has drawbacks, as it requires extra communication.
In tensor model parallelism, individual model layers are partitioned over multiple devices. The proposed method uses a Megatron project-inspired partitioning strategy for transformer layers, the bedrock of language models.
The researchers tested their combined pipeline, tensor model and data parallelism approach to determine whether it could improve communication and computation performance when training GPT model sizes ranging from a billion to a trillion parameters.
The results show the proposed composition of tensor, pipeline, and data parallelism enables training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs, achieving per-GPU throughput of 52 percent of peak, bettering the 36 percent obtained by previous approaches on similar-sized models. The method can scale to thousands of GPUs, and achieves a two-order-of-magnitude increase over existing systems on model sizes that can be efficiently trained.
The code is available on the project GitHub. The paper Efficient Large-Scale Language Model Training on GPU Clusters is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.