Today’s extreme-scale language models have demonstrated astounding performance on natural language processing tasks, attributed mainly to their ever-expanding size, which can surpass 500 billion parameters. But as such models have scaled dramatically in recent years, the amount of data used in their training has not kept pace.
In the new paper Training Compute-Optimal Large Language Models, a DeepMind research team posits that current large language models are significantly undertrained and, based on empirical outcomes of over 400 training runs, proposes three predictive approaches for optimally setting both model size and training duration.
The researchers start with the question: Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens? The paper proposes three approaches for estimating the optimal parameter/training tokens allocation:
- Fix model sizes and vary the number of training tokens
- IsoFLOP profiles
- Fitting a parametric loss function
The team models the final pretraining loss as a function of the number of model parameters and the number of training tokens. As the computational budget is a deterministic function of the number of seen training tokens and model parameters, they minimize the loss function under the constraint of the FLOPs function, which is equal to the computational budget.
In the Fix model sizes and vary the number of training tokens approach, the researchers vary the number of training steps for a fixed family of models, training each model under four different training sequences. As such, they can directly extract an estimate of the minimum loss achieved for a given number of training FLOPs.
The IsoFLOP profiles approach meanwhile varies the model size for a fixed set of nine different training FLOP counts and considers the final training loss for each point, thus answering the question: For a given FLOP budget, what is the optimal parameter count?
In the third approach, Fitting a parametric loss function, the researchers model all final losses from experiments in Approach 1 & 2 as a parametric function of model parameter count and the number of seen tokens. They propose a functional form that captures the loss for an ideal generative process on the data distribution and demonstrates that a perfectly trained transformer underperforms the ideal generative process and that the transformer is not trained to convergence, as only a finite number of optimization steps are made on a sample of the dataset distribution.
The researchers empirically estimate these functions based on the losses of over 400 models, ranging from the compute-optimal 70B model they dub “Chinchilla” to the 530B parameter Megatron-Turing NLG, trained on 5B to over 400B tokens for all three approaches.
The results show that the proposed 70B Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B). The team also discovers that despite using different fitting methodologies and different trained models, these three approaches yield comparable predictions for the optimal scaling in parameters and tokens with FLOPs.
Overall, this work takes a step toward establishing an optimal training paradigm for large auto-regressive language models on a given compute budget. Although it is common practice to increase model size without correspondingly increasing the number of training tokens, the team suggests that for every doubling of model size the number of training tokens should also be doubled; and that employing larger, high-quality training datasets can lead to improved performance on downstream tasks.
The paper Training Compute-Optimal Large Language Models is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.