In the ongoing effort to scale AI systems without incurring prohibitively high training and compute costs, sparse mixture-of-expert models (MoE) have shown their potential for achieving impressive neural network pretraining speedups by dynamically selecting only the related parameters for each input. This enables such networks to vastly expand their parameters while keeping their FLOPs per token (compute) roughly constant. Advancing MoE models to state-of-the-art performance has however been hindered by training instabilities and uncertain quality during fine-tuning.
To address these issues, a research team from Google AI and Google Brain has published a set of guidelines for designing more practical and reliable sparse expert models. The team tested their recommendations by pretraining a 269B sparse model, which it says is the first to achieve state-of-the-art results on natural language processing (NLP) benchmarks.
The team summarizes their main contributions as:
- A large-scale study of the quality-stability trade-offs of stability techniques.
- An introduction of the router z-loss that resolves instability issues, while slightly improving model quality.
- A fine-tuning analysis of sparse and dense models highlighting different hyperparameter sensitivity to the batch size and learning rate. We show bad hyperparameters result in virtually no fine-tuning gain over dense models, despite large pretraining speedups.
- Architectural, routing and model design principles for designing Pareto efficient sparse models in a distributed setting.
- A qualitative analysis tracing token routing decisions across expert layers.
- A 269B sparse model (the Stable Transferable Mixture-of-Experts or ST-MoE-32B) which achieves state-of-the-art performance across a diverse set of natural language benchmarks.
Sparse models often suffer from training instability issues, and existing methods for improving stability can result in diminished model quality. The team explored this and other issues in a large-scale study that yielded various insights with regard to training stability, fine-tuning and model design.
The researchers first examined approaches for improving stability, including removing multiplicative interactions, injecting model noise, and constraining activations and gradients. They concluded that a new auxiliary loss, the router z-loss (an adaptation of the z-loss used for final softmax logits in the Mesh Tensorflow codebase proposed by Shazeer et al., 2018), can significantly improve training stability with no quality degradation.
The team then conducted a sensitivity analysis to explore the impacts of sparse and dense models on the fine-tuning protocol, focusing on two hyperparameters, batch size and the learning rate. They discovered that sparse models benefit from smaller batch sizes and a higher learning rate; and that these changes could improve generalization through higher noise in the fine-tuning process. They also identified the need to correctly tune the batch size and learning rate during fine-tuning, as simply using the same fine-tuning hyperparameters that worked well for the dense model can mask any pretraining improvements obtained by the sparse model.
The researchers also examined several questions involved in sparse model design, such as how many experts to use, which routing algorithm, what value for the capacity factor, and how hardware changes these decisions? They summarize their resulting design recommendations as:
- In our setup, we recommend top-2 routing with a 1.25 capacity factor and at most one expert per core.
- The capacity factor can be changed during evaluation to adjust to new memory/compute requirements.
- Dense layer stacking and a multiplicative bias can boost quality.
Finally, the team applied these principles to design and train a 269B sparse parameter model that achieved state-of-the-art performance across a set of NLP tasks that included reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (WinoGrande, ANLI R3).
The study demonstrates how a model with one-fifth the size but a better balance of computation to parameters will be a more effective sparse learner. The team believes their work validates the power of model sparsity and hopes the proposed guidelines can accelerate the future adoption of such models.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.