Transformer models find applications in a wide array of contexts, spanning from multi-accelerator clusters to stand-alone mobile devices. The various inference requirements in these situations compel practitioners to develop foundational models with the aim of enhancing the model’s ability to generalize. However, the substantial expenses associated with training limit the number of model sizes that are supported, leaving gaps in coverage for a wide range of downstream applications.
In a new paper MatFormer: Nested Transformer for Elastic Inference, a research team from Google Research, University of Texas at Austin, University of Washington and Harvard University proposes MatFormer, a Transformer architecture that is inherently designed for elasticity, enables the training of a single universal model capable of generating numerous smaller submodels without the need for additional training.
The team summarizes their main contributions as follows:
- We introduce MatFormer, which incorporates a nested sub-structure within the standard Transformer and jointly optimizes all the g granularities to produce a single, universal elastic model.
- Employing Mix’n’Match of granularities across layers in a universal MatFormer model yields hundreds of accurate and consistent submodels without any additional training cost.
- MatFormer generalizes effectively to both decoder-only language models (MatLM) and vision encoders (MatViT), scaling as reliably and accurately as the standard Transformer, while enabling significantly faster autoregressive generation and large-scale adaptive dense retrieval.
MatFormer adheres to the concept of matryoshka representation learning by introducing nested substructures in both the attention and feedforward network (FFN) blocks of the Transformer. This nested structure is applied to the hidden representations of the FFN block, enhancing the model’s capabilities.
With this innovative architecture, the team can establish a substructure within the attention heads, organizing the heads from the “most” to the “least” significant. More significant heads are shared among more submodels, and this approach accelerates training by approximately 15% compared to training equivalent Transformer-based submodels independently. Additionally, it enables the extraction of numerous smaller submodels while maintaining accuracy, following the explicitly optimized submodel curve.
The researchers noted that by selecting different granularities for each MatFormer layer, they can generate a large number of accurate smaller models without additional optimization. They refer to this process as Mix’n’Match, and these additional model granularities exhibit high performance despite not being explicitly fine-tuned.
In their empirical study, the team demonstrates the effectiveness of MatFormer across various model categories (decoders and encoders), modalities (language and vision), and scales (up to 2.6 billion parameters). Notably, a 2.6 billion-parameter decoder-only MatFormer language model (MatLM) can extract smaller models ranging from 1.5 billion to 2.6 billion parameters, all displaying comparable validation loss and one-shot downstream performance when compared to independently trained counterparts.
In conclusion, this research highlights the capabilities of MatFormer, an inherently elastic Transformer architecture, enabling the training of a single universal model that can generate numerous precise smaller submodels without incurring additional costs.
The paper MatFormer: Nested Transformer for Elastic Inference on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.