Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation Synced

In the early days of NLP research, establishing long-term dependencies brought with it the vanishing gradient problem, as nascent models handled input sequences one by one, without parallelization. More recently, revolutionary transformer-based architectures and their self-attention mechanisms have enabled interactions of token pairs across full sequences, modelling arbitrary dependencies in a constant number of layers to achieve state-of-the-art performance across many NLP tasks.

These advantages however came with a high cost, as transformer-based networks’ memory and computational requirements grow quadratically with sequence length, resulting in major efficiency bottlenecks when dealing with long sequences. In the new paper Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention, researchers from the University of Wisconsin-Madison, UC Berkeley, Google Brain and American Family Insurance propose Nyströmformer, an O(n) approximation in both memory and time for self-attention designed to reduce the quadratic cost associated with long input sequences.

The Nyström method is an efficient technique for obtaining a low-rank approximation of a large kernel matrix.The researchers’ proposed method leverages Nyström approximation tailored for a softmax matrix to reduce complexity from O(n^2 ) to O(n) for self-attention computation.

*A Nyström approximation of a softmax matrix in self-attention*

*Pipeline for Nyström approximation of softmax matrix in self-attention*

The basic idea behind the algorithm is to first define the matrix form of landmarks, then use these to form the three matrices needed for approximation. The landmarks are selected before the softmax operation to generate the approximation, which avoids calculating the full softmax matrix S. The Nystrom approximation thus scales linearly (O(n) complexity) with regard to input sequence length in terms of both memory and time.

*Proposed architecture of efficient self-attention via Nyström approximation*

Given an input key K and query Q, the proposed Nyströmformer first uses Segment-means to compute landmark points. Based on the landmark points, the architecture then calculates the Nyström approximation using approximate Moore- Penrose pseudoinverse.

To evaluate the model, the researchers conducted experiments in transfer learning setting in two stages. In the first, Nyströmformer was trained on BookCorpus and English Wikipedia data. Next, the pretrained Nyströmformer was fine-tuned for different NLP tasks on the GLUE (General Language Understanding Evaluation) benchmark datasets (SST-2, MRPC, QNLI, QQP and MNLI) and IMDB reviews. For both stages, the baseline was popular transformer model BERT.

*Memory consumption and running time results on various input sequence lengths*

*Results on natural language understanding tasks. F1 score for MRPC and QQP and accuracy for others.*

The results show that Nyströmformer offers favourable memory and time efficiency over standard self-attention and Longformer self-attention, and performs competitively with the BERT-base model. Overall, Nyströmformer provides self-attention approximation with high efficiency, a big step towards running transformer models on very long sequences.

The paper Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

5 comments on “Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation”

Pingback: [N] Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation – ONEO AI
Hanna Wilson

2023-11-10

I recently utilized the research paper writing services of https://payresearchpaper.org/. I was very satisfied with their service. You will have access to professional research paper writers, each of whom is an expert in their field, which ensures that your paper is backed by in-depth knowledge and understanding.

Loading...

Kyle Taylor

2024-03-22

In various industries, from finance and engineering to healthcare and construction, these digital tools play a vital role in facilitating decision-making, optimizing processes, and mitigating risks. Whether it’s determining financial projections, designing structural blueprints, or analyzing medical data, online calculators https://calculatingapp.com/ provide professionals with the computational power needed to make informed choices and drive innovation in their respective fields.

Loading...

- yohannathomas
  
  2024-08-19
  
  As a student needing help with discussion posts, review on the best discussion post writers was incredibly insightful. It highlighted top services and their strengths, making it easy to find a reliable option. The review provided clear comparisons and detailed feedback on various writers, helping me choose a service that delivers quality work. Thanks to this review, I found a writing service that truly understands how to craft engaging and thoughtful discussion posts.
  
  Loading...
  
Margaret Shmitt

2024-11-01

Trabajos Universitarios https://trabajosuniversitarioshechos.com/ emphasizes timely delivery, making it easier for students to rely on meeting deadlines without stress. This dependability is especially valuable during high-demand academic periods, like finals, where managing overlapping commitments is essential.

Loading...

Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation

Like this:

5 comments on “Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation”

Leave a Reply Cancel reply

Related

Share this:

Like this:

5 comments on “Transformers Scale to Long Sequences With Linear Complexity Via Nyström-Based Self-Attention Approximation”

Leave a Reply Cancel reply

Related