State-of-the-art AI models have ballooned to billions of parameters in recent years. Although the machine learning (ML) community has shown keen interest in the scaling properties of transformer-based models, there has been relatively little research on scaling effects with regard to the inductive biases imposed by different model architectures.
In the new paper Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?, a research team from Google and DeepMind posits that understanding the connections between neural network architectures and scaling laws is essential for designing and evaluating new models. The team pretrains and finetunes over 100 models to reveal useful insights on the scaling behaviours of ten diverse model architectures.
The team summarizes their main contributions as:
- For the first time, we derive scaling laws for different inductive biases and model architectures. We find that this scaling coefficient differs greatly from model to model.
- We observe that models that operate well in one compute-scale region are not necessarily the best in another compute region. Moreover, we find that certain models have difficulty scaling despite performing decently (comparably) in lower-compute regions.
- We find that when it comes to scaling different model architectures, upstream pretraining perplexity might not correlate well with the downstream transfer. Hence, the underlying architecture and inductive bias is also crucial for downstream transfer.
- We highlight the difficulties of scaling with certain architectures and show that some models do not scale (or scale with a negative trend). We also find concerning trends where linear time attention models such as Performer struggle with scaling up.
The systematic study aims at answering a number of questions: Do different model architectures scale differently? How does inductive bias affect scaling behaviour? How does scaling impact upstream and downstream model performance?
To answer these questions, the team conducted extensive experiments on a broad spectrum of models, including well-established transformer variants such as Evolved Transformer (So et al., 2019), Universal Transformers (Dehghani et al., 2018) and Switch Transformers (Fedus et al., 2021); lightweight models such as Google’s ALBERT; and efficient transformers such as Performer (Choromanski et al., 2020) and Funnel Transformers (Dai et al., 2020). The study also looks at non-transformer architectures that include Lightweight Convolutions (Wu et al., 2019), Dynamic Convolutions (Wu et al., 2019), and MLP-Mixers (Tolstikhin et al., 2021).
The study reports the number of trainable parameters, FLOPs (of a single forward pass) and speed (steps per second) for different architectures, as well as validation perplexity (on upstream pretraining) and results on 17 downstream tasks.
The team’s analysis leads them to conclude that architecture plays a crucial role in scaling due to intricate factors that are intertwined with architectural choices, that some models may do well on upstream perplexity but fail to transfer to downstream tasks, and that the performance of different models and architectures can fluctuate at different scales. They also show that introducing novel inductive biases can be risky when scaling and suggest ML practitioners be mindful of this when performing expensive runs on transformer architectures that drastically modify the attention mechanism.
The paper Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.