AI Computer Vision & Graphics Machine Learning & Data Science Research

LSTM Is Back! A Deep Implementation of the Decades-old Architecture Challenges ViTs on Long Sequence Modelling

A research team from Rikkyo University and AnyTech Co., Ltd. examines the suitability of different inductive biases for computer vision and proposes Sequencer, an architectural alternative to ViTs that leverages long short-term memory (LSTM) rather than self-attention layers to achieve ViT-competitive performance on long sequence modelling.

In less than two years since their introduction, vision transformers (ViT) have revolutionized the computer vision field, leveraging transformer architectures’ powerful self-attention mechanisms to eliminate the need for convolutions and advance the state-of-the-art on image classification tasks. More recently, approaches such as MLP-Mixer and carefully redesigned convolutional neural networks (CNNs) have achieved ViT-comparable performance, and machine learning researchers continue to seek optimal architectural designs for computer vision tasks.

In the new paper Sequencer: Deep LSTM for Image Classification, a research team from Rikkyo University and AnyTech Co., Ltd. examines the suitability of different inductive biases for computer vision and proposes Sequencer, an architectural alternative to ViTs that uses traditional long short-term memory (LSTM) rather than self-attention layers. Sequencer reduces memory cost by mixing spatial information with memory-economical and parameter-saving LSTM and achieves ViT-competitive performance on long sequence modelling.

The Sequencer architecture employs bidirectional LSTM (BiLSTM) as a building block and, inspired by Hou et al.’s 2021 Vision Permutator (ViP), processes the vertical and horizontal axes in parallel. The researchers introduce two BiLSTMs to enable parallel processing of the top/bottom and left/right directions, which improves Sequencer’s accuracy and efficiency due to reduced sequence length and yields a spatially meaningful receptive field.

Sequencer takes nonoverlapping patches as input and matches them to a feature map. The Sequencer block has two sub-components: 1) a BiLSTM layer that can mix spatial information memory economically and globally, and 2) a multi-layer perceptron (MLP) for channel-mixing. As with existing architectures, the output of the last block is sent to a linear classifier via the global average pooling layer.

In their empirical study, the team compared the proposed Sequencer with CNNs, ViTs, and MLP- and FFT-based model architectures with comparable numbers of parameters on the ImageNet-1K benchmark dataset; and tested its transfer learning capabilities. Sequencer achieved an impressive 84.6 percent top-1 accuracy in the evaluations, bettering ConvNeXt-S and Swin-S by 0.3 and 0.2 percent, respectively, and also demonstrated good transferability and robust resolution adaptability.

The team hopes their work will provide new insights on and improve understanding of the role of various inductive biases in computer vision and inspire further research on optimal architecture designs in this growing field.

The paper Sequencer: Deep LSTM for Image Classification is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “LSTM Is Back! A Deep Implementation of the Decades-old Architecture Challenges ViTs on Long Sequence Modelling

Leave a Reply

Your email address will not be published.

%d bloggers like this: