Mixture of experts (MoE) is a promising deep learning model architecture that can reduce training cost complexity to sublinear to the number of parameters, making model scaling easier and paving the way for models capable of learning much more information and powering a wide range of tasks in fields such as computer vision, speech recognition and natural language processing.
MoE architectures employ an ensemble learning technique that reduces modelling tasks into sub-tasks and trains an expert model on each. Depending on the input to be predicted, a gating model then learns which expert to trust, and combines the predictions. However, despite the non-trivial reduction in training cost delivered by MoE models, their practical applicability remains restricted by several bottlenecks.
To address these limitations, a Microsoft research team has proposed DeepSpeed-MoE, an end-to-end MoE training and inference solution comprising a novel MoE architecture design and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions.
The paper identifies three shortcomings that have limited real-world MoE deployment — limited scope, massive memory requirements and limited inference performance — and details how they are addressed by the proposed DeepSpeed-MoE:
- Limited scope: We expand the scope of MoE based models to auto-regressive NLG tasks, demonstrating training cost reduction of 5x to achieve same model quality for models like GPT-3 and MTNLG. These results not only demonstrate clear opportunities to reduce the cost of training 2 massive NLG models, but also opens up the possibilities to reach much higher next-generation model quality under the limitation of current generation hardware resource.
- Massive memory requirements: We improve parameter efficiency of MoE based models by developing a novel MoE architecture that we call Pyramid-Residual MoE (PR-MoE). PR-MoE is a hybrid dense and MoE model created using residual connections, while applying experts only where they are most effective. PR-MoE can reduce MoE model parameter size by up to 3x with no change to model quality and minimal change to the compute requirements. In addition, we create a distilled version of PR-MoE, which we call Mixture-of-Students (MoS), via staged knowledge distillation. MoS reduces the MoE model size by up to 3.7x while retaining comparable model quality.
- Limited inference performance: We develop the DeepSpeed-MoE inference system, a highly optimized MoE inference system which enables efficient scaling of inference workloads on hundreds of GPUs, providing up to 7.3x reduction in inference latency and cost when compared with existing MoE inference solutions. It offers ultra-fast inference latencies (under 25 ms) for trillion-parameter MoE models. DeepSpeed-MoE also offers up to 4.5x faster and 9x cheaper inference for MoE models compared to quality-equivalent dense models by combining both system and model optimizations.
The team first expands the scope of MoE-based models to auto-regressive natural language generation (NLG) tasks, demonstrating that MoE can achieve a 5x training cost saving on NLG models compared to dense counterparts such as GPT-3, while maintaining model quality.
The researchers then introduce a Pyramid-Residual MoE (PR-MoE) to reduce the standard MoE model parameter size by up to 3x with no change in model quality and negligible impact on compute requirements. On a 350M parameter model, PR-MoE uses less than 1/3 of the parameters compared to the Standard-MoE; and at a 1.3B model size, uses only about 60 percent of the parameters of the Standard-MoE while achieving similar accuracy.
The team then combined these innovations to create DeepSpeed-MoE, a highly optimized MoE inference system that enables efficient scaling and ultra-fast inference latencies.
In empirical evaluations on MoE models from 107 billion parameters to 2 trillion parameters using PyTorch and DeepSpeed, the proposed DeepSpeed-MoE achieved up to 7.3x reductions in latency with up to 7.3x higher throughput compared to the baseline. By effectively exploiting hundreds of GPUs in parallel, DeepSpeed-MoE demonstrated an unprecedented scale for inference at low latencies: a trillion parameter MoE model can be inferenced in under 25ms.
The innovations and infrastructures introduced in this work open a promising direction for addressing the training cost problems of current large-scale deep learning models, and take a step towards training and inference for the next generation of AI scale models,
The researchers plan to open-source DeepSpeed-MoE as part of the DeepSpeed library. The code, tutorials, and documents will be available on the project’s GitHub. The paper DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.