Hyperparameter (HP) tuning is a strenuous, time-consuming and expensive process for today’s deep neural networks (DNNs), which often scale up to billions of parameters. The recently proposed Maximal Update Parametrization method (µP) addresses this issue by enabling “maximal” feature learning in the infinite-width limit, which results in many optimal HPs remaining stable even as model size changes.
A team from Microsoft and OpenAI builds on this research in the new paper Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. Their proposed µTransfer method leverages µP to zero-shot transfer HPs from small models and obtain near-optimal HPs on large models without directly tuning them.
Paper co-author Greg Yang tweeted: “You can’t train GPT-3 on a single GPU, much less tune its hyperparameters (HPs). But what if I tell you you *can* tune its HPs on a single GPU thanks to new theoretical advances?” In experiments, when transferring from a 40M parameter model, the proposed approach bettered published numbers for the 6.7B parameter GPT-3 model with a tuning cost of only 7 percent of the total pretraining cost.
The team summarizes their main contributions as:
- We demonstrate it is possible to zero-shot transfer near-optimal HPs to a large model from a small version via the Maximal Update Parametrization (µP).
- While µP only covered Stochastic gradient descent (SGD), here we derive µP for Adam as well.
- We thoroughly verify our method on machine translation and large language model pretraining as well as image classification.
- We release a PyTorch package for implementing µTransfer painlessly.
The team starts with the premise that HPs do not transfer conventionally, noting that there are conflicting assumptions about HP stability in the deep learning research community. While many HP-tuning approaches are informed by the assumption that models of different sizes are not expected to share optimal HPs, some works fix all HPs when comparing against baselines, suggesting that the optimal HPs should be stable in a given model of different sizes and also among models of different designs.
The researchers examine HP instability issues across width in multilayer perceptrons (MLPs) and transformers in the standard parametrization, then show how µP solves these issues via changes to MLP layer initializations, learning rates and biases, and the attention logit in transformers. They unlock zero-shot transfer capability with µP to produce the proposed µTransfer HP tuning technique.
Tuning large DNNs via µTransfer is done in three steps: 1) Parametrize the target model in Maximal Update Parametrization (µP); 2) Tune a smaller version (in width and/or depth) of the target model; 3) Copy the tuned hyperparameters to target model.
In their empirical study, the team applied µTransfer with transformers on the IWSLT14 De-En and WMT14 En-De datasets, BERT, and GPT-3.
The proposed µTransfer achieved impressive results in all scenarios, outperforming published numbers on BERT-large (350M parameters) by transferring pretraining HPs from a 13M parameter model; and outperforming published numbers for the 6.7B parameter GPT-3 model by transferring from 40M parameters with a tuning cost of only 7 percent of the total pretraining cost.
The team summarizes the benefits of their approach as:
- Better Performance: µTransfer is not just about predicting how the optimal learning rate scales in standard parametrization (SP).
- Speedup: It provides massive speedups in the tuning of large models.
- Tune Once for Whole Family: For any fixed family of models with varying width and depth (such as the BERT or GPT-3 family), we only need to tune a single small model and can reuse its HPs for all models in the family.
- Better Compute Utilization: While large model training needs to be distributed across many GPUs, small model tuning can be done on individual GPUs, greatly increasing the level of parallelism for tuning (and in the context of organizational compute clusters, better scheduling and utilization ratio).
- Painless Transition from Exploration to Scaling Up: Often, researchers explore new ideas on small models but, when scaling up, find their HPs optimized during exploration work poorly on large models. µTransfer would solve this problem.
Overall, this work shows it is possible to transfer HPs across depth, batch size, sequence length and training time (with a few caveats). This will enable researchers to avoid expensive HP tuning procedures by indirectly tuning very large networks via HP transfer from their smaller counterparts.
The code and a Python implementation are available on the project’s GitHub. The paper Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.