Training extremely large deep learning (DL) models on clusters of high-performance accelerators involves significant engineering efforts for both model definition and training cluster environment specifications; and typically requires tuning a complex combination of data, operator and pipeline parallelization approaches for the individual operators in the network.
Automating the parallelization of large-scale models could accelerate DL development and application, but remains a challenging task due to the complex structures involved. To address this issue, a research team from UC Berkeley, Amazon Web Services, Google, Shanghai Jiao Tong University and Duke University has proposed Alpa, a compiler system for distributed DL on GPU clusters that automatically generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on the models they were designed for.
The team summarizes their main contributions as:
- We construct a two-level parallel execution plan space where plans are specified hierarchically using inter and intra-operator parallelisms.
- We design tractable optimization algorithms to derive optimal execution plans at each level.
- We implement Alpa, a compiler system for distributed DL on GPU clusters. Alpa features: (1) a set of compilation passes that generate execution plans using the hierarchical optimization algorithms, (2) a new runtime architecture that orchestrates the inter-op parallelism between stages and device meshes, and (3) a number of system optimizations that improve compilation and address cross-mesh communication.
- We evaluate Alpa on training large models with billions of parameters.
Existing DL parallelization strategies are usually categorized as data, operator, or pipeline parallelisms. In data parallelism, the training data is partitioned across distributed workers, with each worker computing the parameter updates on its independent data split. Operator parallelism meanwhile partitions the computation of a specific operator and computes each part of the operator in parallel across multiple devices. Pipeline parallelism is an approach that places different groups of operators on different workers and splits the training batch as a number of micro-batches, then pipelines the forward and backward passes across micro-batches on distributed workers.
In a departure from this conventional schema, the team recategorizes existing parallelization approaches into two orthogonal categories: intra-operator and inter-operator parallelisms. Intra-operator parallelism partitions ML operators along any tensor axes (batch or non-batch) and dispatches the partitions to distributed devices, while inter-operator parallelism slices the model into disjoint stages and pipelines the execution of stages on different sets of devices.
A parallel execution plan can thus be expressed hierarchically by specifying the plan in each parallelism category, resulting in two key advantages: 1) It becomes possible to harness the asymmetric nature of communication bandwidth in a compute cluster and map intra-operator parallelism to devices connected with high communication bandwidth while orchestrating the inter-operator parallelism between distant devices with relatively lower bandwidth; and 2) Each level can be solved optimally as an individual tractable sub-problem, leading to strong performance improvements when training large models.
Overall, Alpa’s novel contribution as a compiler is that it generates model-parallel execution plans by hierarchically optimizing the plan at two different granularities: intra-op and inter-op parallelism.
The team evaluated Alpa on the training of large-scale models with billions of parameters, including GPT-3, GShard Mixture-of-Experts (MoE), and Wide-ResNet; and compared its performance against two state-of-the-art distributed systems, Nvidia’s Megatron-LM v2 and Microsoft’s DeepSpeed.
In the evaluations, Alpa achieved performance comparable with the specialized Megatron-LM system on GPT models. Compared to the hand-tuned DeepSpeed on GShard MoE models, Alpa achieved a 3.5x speedup on two nodes and a 9.7x speedup on four nodes.
The team believes Alpa can democratize distributed model-parallel learning and accelerate the adoption of emerging large deep learning models, and they plan to make Alpa’s source code publicly available.
The paper Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.