The structure of modern machine learning (ML) models can be defined as dataflow graphs between connected layers. Neural network training first computes the final output (forward pass), then computes the gradients of each layer weight (backward pass). Recent studies have shown that parallelizing computation in terms of training data and model parameters can produce dramatic benefits for ML computations.
The general idea behind parallelization is to break down larger problems into smaller parts that can be executed simultaneously by multiple processors that communicate via a shared memory system. To enable general and scalable parallelization for ML computation graphs, a research team from Google recently proposed GSPMD, an automatic parallelism system that uses simple tensor sharding annotations to achieve different parallelism paradigms in a unified way, including data parallelism, within-layer model parallelism, spatial partitioning, weight-update sharding or optimizer-state sharding and pipeline parallelism.
The proposed GSPMD is based on Google’s previous back-end system GShard and a new Single Program Multiple Data (SPMD) property and designed as a general solution for common parallelism patterns in ML workloads. The team summarizes the benefits of GSPMD as:
- GSPMD produces a single program for all partitions instead of generating one program for each partition, which would increase compilation time significantly when there are many partitions.
- GSPMD supports unevenly partitioned dimensions, which allows any tensor to be partitioned on an arbitrary mesh of devices.
- GSPMD is implemented as an extension to a production ML compiler, XLA (Optimizing Compiler for TensorFlow ). The implementation covers the full set of operators in XLA, including those with complicated semantics.
- GSPMD does not require an accelerator platform to support dynamic tensor shapes or operator configurations.
- GSPMD supports nested patterns of parallelism. At the per-operator level that means different types of dimensions can be partitioned across orthogonal subgroups of devices.
GSPMD defines an intuitive and general representation of tensor sharding that has two independent compiler transformations: sharding completion and per-operator partitioning.
The sharding property specifies how the data is distributed across devices. GSPMD defines three types of sharding: replicated, where every partition has full data; tiled sharding, which uses a multi-dimensional tensor consisting of device IDs that have the same rank as the data tensor; and partially tiled, wherein devices are divided into equally sized subgroups and the data tensor is replicated across devices in each subgroup but tiled across subgroups. GSPMD provides a convenient abstraction on these sharding representations and lets users employ a different device mesh for each tensor so that these devices can be organized in a logical multi-dimensional tensor.
The team explains that GSPMD auto-completes the sharding on every tensor based on limited user annotations by preserving dimensions in operators, merging compatible shardings and iterative, priority-based sharding propagation, and providing guidance for users. This enables GSPMD to make sharding decisions intuitive to users even if they only annotate a small subset of tensors.
Besides assigning a sharding property on every tensor, GSPMD also rewrites each operator to an equivalent partitioned computation. There are two options for the implementation of the partitioner: creating a customized program for each of the partitions (Multiple Programs Multiple Data, or MPMD), or creating a single program that works for all partitions (Single Program Multiple Data).
To evaluate the approach, the researchers applied GSPMD to widely-used language and image models and measured performance on the Cloud TPUv3 platform. For language models, they conducted 2D sharding on a dense transformer language model to test whether 2D sharding could make the model weights fit into accelerator device memory. For image models, they sharded activation tensors on image spatial partitioning of the 3D U-Net model to evaluate GSPMD’s convolution partitioning performance.
In the experiments, GSPMD achieved 50 percent to 62 percent compute utilization on 128 to 2048 Cloud TPUv3 cores for models with up to one trillion parameters. The results validate GSPMD as an effective single program for all devices that is also very scalable, as the compilation time stays constant even with an increasing number of devices.
The paper GSPMD: General and Scalable Parallelization for ML Computation Graphs is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.