Pretrained general-purpose language models have achieved astonishing performance on a wide variety of downstream natural language processing (NLP) tasks via zero-shot, few-shot and finetuning techniques. This has enabled real-world applications such as automated question answering, text prediction and classification for spam filtering and news and information feeds, text inference and generation and much more.
These large language models however have also seen an astounding increase in size, imposing heavy compute burdens on their training and the need for expensive high-performance hardware, software, and complex algorithmic techniques. The machine learning research community has responded by exploring efficient parallelism techniques that are scalable on both memory and compute for large-scale model training.
In the new paper Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model, a team from Microsoft and NVIDIA leverages the NVIDIA Megatron-LM large transformer model and Microsoft’s DeepSpeed deep learning optimization library to create an efficient and scalable 3D parallel system that combines data, pipeline, and tensor-slicing based parallelism. The team builds Megatron-Turing NLG 530B (MT-NLG), the world’s largest transformer-based language model with 530 billion parameters (3x more than GPT-3), which achieves superior zero-, one-, and few-shot learning accuracies and new state-of-the-art results on NLP benchmarks.
The proposed 3D parallel system software stack combines pipeline parallelism and data parallelism from DeepSpeed with tensor slicing from Megatron. Data, tensor, and pipeline parallelism are all crucial for improving memory and compute efficiency.
The researchers divide transformer blocks into pipeline stages, with the blocks of each stage further divided via tensor parallelism to simultaneously reduce the memory consumed by the weights, gradients, optimizer states and activations. Data parallelism is employed to scale to an arbitrarily large number of GPUs then further scale to thousands of GPUs to accelerate training. The 3D parallelism implementation is optimized using topology-aware mapping, which minimizes communication overhead to achieve excellent compute efficiency at scale.
Model training is performed with mixed precision using 16-bit bfloat on NVIDIA’s Selene supercomputer with 560 DGX A100 nodes, resulting in an aggregate of 1.4 exaFLOP/s of peak 16-bit precision performance.
The paper provides additional details on the 3D parallel system’s training process, training corpus design and data curation techniques; and includes evaluations on NLP tasks such as reading comprehension, commonsense reasoning and word sense disambiguation.
In experiments, MT-NLG achieved significant improvements in zero-, one-, and few-shot learning, indicating the proposed 3D parallel system’s effectiveness as a large-scale language model training strategy. The team believes their results and findings can help shape and facilitate future research in foundational, large-scale pretraining.
The paper Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.