Machine Learning & Data Science Nature Language Tech Popular

Google, Cambridge, DeepMind & Alan Turing Institute’s ‘Performer’ Transformer Slashes Compute Costs

A team from Google, University of Cambridge, DeepMind, and Alan Turing Institute have proposed a new type of Transformer dubbed Performer, based on a Fast Attention Via positive Orthogonal Random features (FAVOR+) backbone mechanism.

It’s no coincidence that Transformer neural network architecture is gaining popularity across so many machine learning research fields. Best known for natural language processing (NLP) tasks, Transformers not only enabled OpenAI’s 175 billion parameter language model GPT-3 to deliver SOTA performance, the power- and potential-packed architecture also helped DeepMind’s AlphaStar bot defeat professional StarCraft players. Researchers have now introduced a way to make Transformers more compute-efficient, scalable and accessible.

While previous learning approaches such as RNNs suffered from vanishing gradient problems, Transformers’ game-changing self-attention mechanism eliminated such issues. As explained in the paper introducing Transformers — Attention Is All You Need, the novel architecture is based on a trainable attention mechanism that identifies complex dependencies between input sequence elements.

Transformers however scale quadratically when the number of tokens in an input sequence increases, making their use prohibitively expensive for large numbers of tokens. Even when fed with moderate token inputs, Transformers’ gluttonous appetite for computational resources can be difficult for many researchers to satisfy.

A team from Google, University of Cambridge, DeepMind, and Alan Turing Institute have proposed a new type of Transformer dubbed Performer, based on a Fast Attention Via positive Orthogonal Random features (FAVOR+) backbone mechanism. The team designed Performer to be “capable of provably accurate and practical estimation of regular (softmax) full rank attention, but of only linear space and timely complexity and not relying on any priors such as sparsity or low-rankness.”

image.png

Softmax has been a bottleneck burdening attention-based Transformers computation. Transformers typically use a learned linear transformation and softmax function to convert decoder output to predicted next-token probabilities. The proposed method instead estimates softmax and Gaussian kernels with positive orthogonal random features for a robust and unbiased estimation of regular softmax attention in the FAVOR+ mechanism. The research confirms that using positive features can efficiently train softmax-based linear Transformers.

image.png
image.png
image.png

Leveraging detailed mathematical theorems, the paper demonstrates that rather than relying solely on computational resources to boost performance, it is also possible to develop improved and efficient Transformer architectures that have significantly lower energy consumption. Also, because Performers use the same training hyperparameters as Transformers, the FAVOR+ mechanism can function as a simple drop-in without much tuning.

The team tested Performers on a rich set of tasks ranging from pixel-prediction to protein sequence modelling. In their experimental setup, a Performer only replaced a regular Transformer’s attention component with the FAVOR+ mechanism. On the challenging task of training a 36-layer model using protein sequences, the Performer-based model (Performer-RELU) achieved better performance than the baseline Transformer models Reformer and Linformer, which showed significant drops in accuracy. On the standard ImageNet64 benchmark, a Performer with six layers matched the accuracy of a Reformer with 12 layers. After optimizations, Performer was also twice as fast as Reformer.

Because Performer-enabled scalable Transformer architectures can handle much longer sequences without constraints on attention mechanism structure while remaining accurate and robust, it is believed they could lead to breakthroughs in bioinformatics, where technologies such as such as language modelling for proteins have already shown strong potential.

The paper Rethinking Attention With Performers is on arXiv.


Reporter: Fangyu Cai | Editor: Michael Sarazen


B4.png

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon KindleAlong with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.


AI Weekly.png

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

4 comments on “Google, Cambridge, DeepMind & Alan Turing Institute’s ‘Performer’ Transformer Slashes Compute Costs

%d bloggers like this: