Studies have shown that scaling up powerful pretrained models and their training data sizes significantly improves performance, and that these performance improvements can transfer to downstream tasks, even in few-shot settings. But is there a limit to the performance improvements attainable via such model size and training data scale-ups?
To answer this question, a Google Research team conducted a systematic exploration comprising more than 4,800 experiments on Vision Transformer, MLP-Mixer and ResNet architectures with parameters ranging from 10 million to 10 billion, evaluated on more than 20 downstream image recognition tasks. The study, Exploring the Limits of Large Scale Pre-training, aims tocapture the nonlinear relationships between performance on upstream and downstream tasks.
The team first investigated how performance improvements on upstream tasks impact performance on different downstream tasks. They investigated downstream (DS) vs upstream (US) performance in a wide range of experiments that varied in terms of model size and shape, optimization method, compute and hyperparameters. They plotted DS-vs-US performance in more than 3K experiments with architectures pretrained on the huge JFT image dataset and evaluated on DS tasks in the few-shot setting (25 shots), and presented a similar plot of all 4,800 experiments with both single (1-) and 25-shot training.
The results show that when upstream accuracy is increased, the performance on downstream tasks saturates, and that different DS tasks have different saturation values. The team then investigated the reasons behind this saturation behaviour in the DS-vs-US accuracy plots, and why saturation occurs much earlier for some DS tasks compared to others.
The researchers discovered that performance saturation on DS tends to happen when a pretrained network lacks the fine-grained features required to perform well on a DS. They note that keeping US accuracy fixed and increasing DS accuracy does not necessarily lead to a vertical improvement, and that in some cases improving DS accuracy comes at the price of reduced US accuracy. They also investigated the effect of head weight decay and head learning rate, and examined generalization with regard to the observed phenomena.
The team summarizes their main findings and contributions as:
- We establish through extensive study that as we improve the performance of the upstream (US) task either by scaling up or hyper-parameter and architectural choices, the performance of downstream (DS) tasks shows a saturating behaviour. In our experiments, several DS tasks reach full saturation within the studied range.
- We demonstrate that given a set of models with similar US accuracy, the best model for a DS task TDS1 might have much worse performance on another DS task TDS2 compared to the best model for TDS2.
- Given the scale of experiments, it is crucial for the proposed model to not be impacted by the density of the points in the DS-vs-US plot. We argue and demonstrate that fitting the power law to the convex hull of experiments would circumvent the effect of sampling biases on the prediction of downstream accuracy and show the robustness of our model to sample size variations.
- Having observed the nonlinear relationship between upstream and downstream accuracy, to predict downstream performance for a given upstream accuracy, we model their relationship with a power law curve and establish that it captures the behaviour well even with a small number of samples.
- We study how scaling up the model size, data size, and compute affects DS performance and show that these parameters impact DS performance mainly through the US performance.
- We further explore the discrepancy between upstream and downstream performances and show that for some choices of hyperparameters, they might be at odds with each other. In particular, we showcase how the optimal hyper-parameters for the head used in pre-training (upstream task) are different for US and DS.
- Finally, we show how our observations are robust to several choices such as the size of upstream data, choice of common scaling of accuracy, number of shots, transfer vs few-shot setting and architecture.
In a challenge to the common narrative, the researchers conclude that scaling does not lead to a one-model-fits-all solution. They propose that there is no single pretrained checkpoint that will perform well on all possible downstream tasks, and that instead of focusing on a particular downstream task, researchers should instead make design choices that improve performance across a wide breadth of downstream tasks.
The paper Exploring the Limits of Large Scale Pre-training is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.