AI Computer Vision & Graphics Machine Learning & Data Science Research

New Study Suggests Self-Attention Layers Could Replace Convolutional Layers on Vision Tasks

Inspired by the performance of attention mechanisms in NLP, researchers have explored the possibility of applying them to vision tasks.

Nowhere has AI experienced greater development or breakthroughs in recent years than in the field of natural language processing (NLP) — and “transformers” are the not-so-secret new technology behind this revolution. The key difference between transformers and traditional methods such as recurrent neural networks or convolutional neural networks is that transformers can simultaneously attend to every word of an input text. Transformers’ impressive performance across a wide range of NLP tasks is enabled by a novel attention mechanism which captures meaningful inter-dependencies between words in a sequence by calculating both positional and content-based attention scores.

Inspired by the performance of attention mechanisms in NLP, researchers have explored the possibility of applying them to vision tasks. Google Brain Team researcher Prajit Ramachandran proposed that self-attention layers could completely replace convolutional layers on vision tasks as well as achieve state-of-the-art performance. To confirm this theory, researchers from Ecole Polytechnique Federale de Lausanne (EPFL) put forth theoretical and empirical evidence which indicates that self-attention layers can indeed achieve the same performance as convolutional layers.

From a theoretical perspective, the researchers used constructive proof to show that a multi-head self-attention layer can simulate any convolutional layer.

The researchers set the parameters of a multi-head self-attention layer so that it could act like a convolutional layer and conducted a series of experiments to validate the applicability of the proposed theoretical construction, comparing a fully attentional model comprising six multi-head self-attention layers with a standard ResNet18 on the CIFAR-10 dataset.

image.png
Test accuracy on CIFAR-10

In the tests the self-attention models performed reasonably well except in learned embeddings with content-based attention — this mainly due the increased number of parameters. The researchers however confirmed that with theoretical and empirical support any convolutional layer can be expressed by self-attention layers and the fully-attentional models can learn to combine local behavior and global attention based on input content.

The paper On the Relationship Between Self-Attention and Convolutional Layer is on arXiv.


Author: Hecate He | Editor: Michael Sarazen

1 comment on “New Study Suggests Self-Attention Layers Could Replace Convolutional Layers on Vision Tasks

  1. Topic is really excellent.

Leave a Reply to zouzou Cancel reply

Your email address will not be published.

%d bloggers like this: