Google Replaces BERT Self-Attention with Fourier Transform: 92% Accuracy, 7 Times Faster on GPUs

Synced

5 years ago

Transformer architectures have come to dominate the natural language processing (NLP) field since their 2017 introduction. One of the only limitations to transformer application is the huge computational overhead of its key component — a self-attention mechanism that scales with quadratic complexity with regard to sequence length.

New research from a Google team proposes replacing the self-attention sublayers with simple linear transformations that “mix” input tokens to significantly speed up the transformer encoder with limited accuracy cost. Even more surprisingly, the team discovers that replacing the self-attention sublayer with a standard, unparameterized Fourier Transform achieves 92 percent of the accuracy of BERT on the GLUE benchmark, with training times that are seven times faster on GPUs and twice as fast on TPUs.

Transformers’ self-attention mechanism enables inputs to be represented with higher-order units to flexibly capture diverse syntactic and semantic relationships in natural language. Researchers have long regarded the associated high complexity and memory footprint as an unavoidable trade-off on transformers’ impressive performance. But in the paper FNet: Mixing Tokens with Fourier Transforms, the Google team challenges this thinking with FNet, a novel model that strikes an excellent balance between speed, memory footprint and accuracy.

FNet is a layer normalized ResNet architecture with multiple layers, each of which consists of a Fourier mixing sublayer followed by a feedforward sublayer. The team replaces the self-attention sublayer of each transformer encoder layer with a Fourier Transform sublayer. They apply 1D Fourier Transforms along both the sequence dimension and the hidden dimension. The result is a complex number that can be written as a real number multiplied by the imaginary unit (the number “i” in mathematics, which enables solving equations that do not have real number solutions). Only the result’s real number is kept, eliminating the need to modify the (nonlinear) feedforward sublayers or output layers to handle complex numbers.

The team decided to replace self-attention with Fourier Transform — based on 19th century French mathematician Joseph Fourier’s technique for transforming a function of time to a function of frequency — because they found it a particularly effective mechanism for mixing tokens, enabling it to provide the feedforward sublayers sufficient access to all tokens.

In their evaluations, the team compared multiple models, including BERT-Base, an FNet encoder (replace every self-attention sublayer with a Fourier sublayer), a Linear encoder (replace each self-attention sublayer with linear sublayers), a Random encoder (replace each self-attention sublayer with constant random matrices) and a Feed Forward-only encoder (remove the self-attention sublayer from the Transformer layers).

The team summarized their results and FNet performance as:

By replacing the attention sublayer with standard, unparameterized Fourier Transform, FNet achieves 92 percent of the accuracy of BERT in a common classification transfer learning setup on the GLUE benchmark, but training is seven times as fast on GPUs and twice as fast on TPUs.
An FNet hybrid model containing only two self-attention sublayers achieves 97 percent of BERT accuracy on the GLUE benchmark, but trains nearly six times as fast on GPUs and twice as fast on TPUs.
FNet is competitive with all the “efficient” transformers evaluated on the Long Range Arena benchmark while having a lighter memory footprint across all sequence lengths.

The study shows that replacing a transformer’s self-attention sublayers with FNet’s Fourier sublayers achieves remarkable accuracy while significantly speeding up training time, indicating the promising potential of using linear transformations as a replacement for attention mechanisms in text classification tasks.

The paper FNet: Mixing Tokens with Fourier Transforms is on arXiv.

Author: Hecate He | Editor: Michael Sarazen, Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Share this: