In less than two years since their introduction, vision transformers (ViT) have revolutionized the computer vision field, leveraging transformer architectures’ powerful self-attention mechanisms to eliminate the need for convolutions and advance the state-of-the-art on image classification tasks. More recently, approaches such as MLP-Mixer and carefully redesigned convolutional neural networks (CNNs) have achieved ViT-comparable performance, and machine learning researchers continue to seek optimal architectural designs for computer vision tasks.
In the new paper Sequencer: Deep LSTM for Image Classification, a research team from Rikkyo University and AnyTech Co., Ltd. examines the suitability of different inductive biases for computer vision and proposes Sequencer, an architectural alternative to ViTs that uses traditional long short-term memory (LSTM) rather than self-attention layers. Sequencer reduces memory cost by mixing spatial information with memory-economical and parameter-saving LSTM and achieves ViT-competitive performance on long sequence modelling.

The Sequencer architecture employs bidirectional LSTM (BiLSTM) as a building block and, inspired by Hou et al.’s 2021 Vision Permutator (ViP), processes the vertical and horizontal axes in parallel. The researchers introduce two BiLSTMs to enable parallel processing of the top/bottom and left/right directions, which improves Sequencer’s accuracy and efficiency due to reduced sequence length and yields a spatially meaningful receptive field.

Sequencer takes nonoverlapping patches as input and matches them to a feature map. The Sequencer block has two sub-components: 1) a BiLSTM layer that can mix spatial information memory economically and globally, and 2) a multi-layer perceptron (MLP) for channel-mixing. As with existing architectures, the output of the last block is sent to a linear classifier via the global average pooling layer.


In their empirical study, the team compared the proposed Sequencer with CNNs, ViTs, and MLP- and FFT-based model architectures with comparable numbers of parameters on the ImageNet-1K benchmark dataset; and tested its transfer learning capabilities. Sequencer achieved an impressive 84.6 percent top-1 accuracy in the evaluations, bettering ConvNeXt-S and Swin-S by 0.3 and 0.2 percent, respectively, and also demonstrated good transferability and robust resolution adaptability.
The team hopes their work will provide new insights on and improve understanding of the role of various inductive biases in computer vision and inspire further research on optimal architecture designs in this growing field.
The paper Sequencer: Deep LSTM for Image Classification is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
I know this is one of the most important things I’ve learned. I’m excited as I read your article. But I should say a few things about the site in general: the style is perfect, and the articles are great.
Pingback: LSTM Is Back! A Deep Implementation of the Decades-old Architecture Challenges ViTs on Long Sequence Modelling | June 2023 | Artificial Intelligence Journal