Transformer-based neural networks have revolutionized natural language processing and computer vision fields due to their undeniably powerful performance, but most of the state-of-the-art models reach billions to trillions of parameters, therefore it comes with a high cost in terms of compute and time for model training.
To address these limitations, in a new paper Composable Function-preserving Expansions for Transformer Architectures, a research team from Google DeepMind and University of Toulouse introduces parameter expansion transformations for transformer-based neural networks while preserving functionality, enabling the expansion of the capability of the model as needed.

The team summarizes their main contributions are six composable function preserving transformations for Transformer-based architecture: 1) size of MLP internal representation, 2) number of attention heads, 3) size of the attention heads output representation, 4) size of the attention input representation, 5) size of the transformer layers input/output representations, 6) number of layers.

The MLP expansion is applied to expand MLP scale by expanding the dimension of its internal representation through parameter matrix transformations; The Head addition transformation is used add an arbitrary number new heads in a MHA component; The Heads expansion transformation is to expand the dimension of the representation that are generated by attention heads; The Attention expansion transformation is applied to expand the key and query representations pairs that are used to generate the attention weights matrix; The Hidden dimension expansion transformation can expand the dimension of the representation produced by the transformer layers. The Layer addition transformation is used to insert an new layer in the Transformer architecture.
For each function preserving function, the team provides proof of exact function preservation under minimal initialization constraints. They believe their work enables the training for larger and more powerful models to be more efficiently by progressively expanding the architecture.
The paper Composable Function-preserving Expansions for Transformer Architectures on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: DeepMind & Toulouse U Contribute Composable Function Preserving Transformations to Boost Transformer Training
Pingback: DeepMind & Toulouse U Contribute Composable Function Preserving Transformations to Boost Transformer Training – Ai Headlines