Transformers on Edge Devices? Monash U’s Energy-Saving Attention With Linear Complexity Reduces Compute Cost by 73%

While transformer architectures have achieved remarkable success in recent years thanks to their impressive representational power, their quadratic complexity entails a prohibitively high energy consumption that hinders their deployment in many real-life applications, especially on resource-constrained edge devices.

A Monash University research team addresses this issue in the new paper EcoFormer: Energy-Saving Attention with Linear Complexity, proposing an attention mechanism with linear complexity — EcoFormer — that replaces expensive multiply-accumulate operations with simple accumulations and achieves a 73 percent energy footprint reduction on ImageNet.

The team summarizes their main contributions as follows:

We propose a new binarization paradigm to better preserve the pairwise similarity in softmax attention. In particular, we present EcoFormer, an energy-saving attention with linear complexity powered by kernelized hashing to map the queries and keys into compact binary codes.
We learn the kernelized hash functions based on the ground truth Hamming affinity extracted from the attention scores in a self-supervised way.
Extensive experiments on CIFAR-100, ImageNet-1K, and Long Range Arena show that EcoFormer is able to significantly reduce energy costs while preserving accuracy.

The basic idea informing this work is to reduce attention’s high cost by applying binary quantization to the kernel embeddings to replace energy-expensive multiplications with energy-efficient bit-wise operations. The researchers note however that conventional binary quantization methods focus only on minimizing the quantization error between the full-precision and binary values, which fails to preserve the pairwise semantic similarity among attention’s tokens and thus negatively impacts performance.

To mitigate this issue, the team introduces a novel binarization method that uses kernelized hashing with a Gaussian Radial Basis Function (RBF) to map the original high-dimensional query/key pairs to low-dimensional similarity-preserving binary codes. EcoFormer effectively leverages this binarization method to maintain semantic similarity in attention while approximating the self-attention in linear time with a lower energy cost.

In their empirical study, the team compared the proposed EcoFormer with standard multi-head self-attention (MSA) on ImageNet1K. The results show that EcoFormer can reduce energy consumption by 73 percent while incurring only a 0.33 percent performance tradeoff.

Overall, the proposed EcoFormer energy-saving attention mechanism with linear complexity represents a promising approach for alleviating the cost bottleneck that has limited the deployment of transformer models. In future work, the team plans to explore binarizing transformers’ value vectors in attention, multi-layer perceptrons and non-linearities to further reduce energy cost; and to extend EcoFormer to NLP tasks such as machine translation and speech analysis.

The Code will be available on the project’s GitHub. The paper EcoFormer: Energy-Saving Attention with Linear Complexity is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Transformers on Edge Devices? Monash U’s Energy-Saving Attention With Linear Complexity Reduces Compute Cost by 73%

Like this:

0 comments on “Transformers on Edge Devices? Monash U’s Energy-Saving Attention With Linear Complexity Reduces Compute Cost by 73%”

Leave a Reply Cancel reply

Related

Share this:

Like this:

0 comments on “Transformers on Edge Devices? Monash U’s Energy-Saving Attention With Linear Complexity Reduces Compute Cost by 73%”

Leave a Reply Cancel reply

Related