Google & Columbia U’s Mnemosyne: Learning to Train Transformers With Transformers

Training deep and complex machine learning (ML) models involves determining the best optimizer and then manually tuning its hyperparameters — a process that is both computationally intensive and time-consuming. Learning-to-learn (L2L) systems have recently emerged as a more efficient alternative to conventional human-engineered ML optimizers.

A team from Google and Columbia University advances this research direction in the new paper Mnemosyne: Learning to Train Transformers with Transformers, proposing Mnemosyne Optimizer, an L2L system designed to train entire neural network architectures without any task-specific optimizer tuning.

Mnemosyne is based on the scalable low-rank implicit attention memory cells successfully employed in Performer architectures (Choromanski et al., 2021) and on methods for approximating attention via low-rank decomposition of the attention matrix. Mnemosyne is designed to mitigate the quadratic complexity burden of regular attention and learn to train an entire neural network architecture.

Conventional transformers can be regarded as differentiable dictionaries applying powerful associative memory mechanisms with exponential memory. Linear low-rank attention mechanisms meanwhile are compact variants of such mechanisms and, as such, are more suitable for scalable memory systems.

The team identifies Mnemosyne’s main benefits as follows:

It generalizes better than popular LSTM optimizers.
It can successfully train vision transformers (ViTs) while meta-trained on standard multilayer perceptrons (MLPs).
It can initialize optimizers for faster convergence in robotics applications.

The team also undertakes a theoretical analysis of Mnemosyne’s compact associative memory (CAM), demonstrating that, like its regular non-compact counterparts, CAM is capable of storing and restoring patterns — but favourably differentiates itself in its ability to do so in an implicit manner.

In their empirical study, the researchers meta-trained Mnemosyne and evaluated it on NN training tasks with various architectures and datasets. The results show that Mnemosyne can optimize MLPs with different NN architectures and activation functions and converges faster than other optimizers.

This paper introduces and details the substantial meta-learning capability of the proposed Mnemosyne. The researchers add that, to the best of their knowledge, theirs is also the first approach to provide strong capacity results for CAM mechanisms. They hope their work will stimulate further research on the applications of learnable attention-based optimizers for complex ML training.

Additional Mnemosyne information and samples are available on the Google project page. The paper Mnemosyne: Learning to Train Transformers with Transformers is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Google & Columbia U’s Mnemosyne: Learning to Train Transformers With Transformers

Like this:

1 comment on “Google & Columbia U’s Mnemosyne: Learning to Train Transformers With Transformers”

Leave a Reply Cancel reply

Related

Share this:

Like this:

1 comment on “Google & Columbia U’s Mnemosyne: Learning to Train Transformers With Transformers”

Leave a Reply Cancel reply

Related