AI Computer Vision & Graphics Machine Learning & Data Science Research

Tsinghua & NKU’s Visual Attention Network Combines the Advantages of Convolution and Self-Attention, Achieves SOTA Performance on CV Tasks

In the new paper Visual Attention Network, a research team from Tsinghua University and Nankai University introduces a novel large kernel attention (LKA) mechanism for an extremely simple and efficient Visual Attention Network (VAN) that significantly outperforms state-of-the-art vision transformers and convolutional neural networks on various computer vision tasks.

The powerful self-attention mechanisms in transformer architectures have significantly improved the state-of-the-art across a wide range of natural language processing (NLP) tasks, and more recently, vision transformers (ViT) have been favourably applied to computer vision (CV) tasks. This adaptation however introduces three issues: 1) Treating images as 1D sequences neglects their 2D structures, 2) The quadratic complexity is too expensive for high-resolution images, and 3) It captures spatial adaptability but ignores channel adaptability.

In the new paper Visual Attention Network, a research team from Tsinghua University and Nankai University addresses these issues, proposing a novel large kernel attention (LKA) module and an extremely simple and efficient Visual Attention Network (VAN) that significantly outperforms state-of-the-art ViTs and convolutional neural networks on various CV tasks.

Unlike conventional self-attention mechanisms, LKA is tailored for CV tasks. It is designed to combine the advantages of convolution and self-attention (such as local structure information and long-range dependence and adaptability), while avoiding their disadvantages (such as ignoring channel adaptability). Based on this novel LKA mechanism, the proposed VAN is able to outperform popular CNN- and transformer-based backbones by a large margin.

The team summarizes their paper’s main contributions as:

  1. We design a novel attention mechanism named LKA for computer vision, which considers the pros of both convolution and self-attention, while avoiding their cons. Based on LKA, we further introduce a simple vision backbone called VAN.
  2. We show that VANs outperform the state-of-the-art ViTs and CNNs with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc.

To leverage the pros of self-attention and large kernel convolution while avoiding their cons, the team decomposes a large kernel convolution operation comprising a spatial local convolution (depth-wise convolution), a spatial long-range convolution (depth-wise dilation convolution), and a channel convolution (1×1 convolution). The decomposition process enables the model to capture long-range relationships with minimal computational cost and parameters. After obtaining the long-range relationships, it is possible to estimate the importance of a given point and generate an attention map.

In each VAN stage, the input is downsampled with the stride number used to control the downsample rate. Batch normalization, GELU activation, large kernel attention and convolutional feed-forward network are then stacked in sequence to extract features. Finally, layer normalization is applied at the end of each stage.

To evaluate the effectiveness of the proposed VAN, the team conducted quantitative experiments on the ImageNet-1K image classification dataset, the COCO object detection dataset, and the ADE20K semantic segmentation dataset.

In the tests, VAN outperformed common CNNs, ViTs and MLPs with similar parameters and computational cost on image classification tasks and surpassed CNN-based method ResNet and transformer-based method PVT with a large margin under RetinaNet 1x and Mask R-CNN 1x settings. VAN achieved state-of-the-art performance with different detection methods such as Mask R-CNN and Sparse R-CNN, and demonstrated excellent performance with fewer parameters and FLOPs compared to previous state-of-the-art CNN- and Swin Transformer-based methods on semantic segmentation tasks.

Overall, VAN achieves state-of-the-art performance on tasks such as image classification, object detection, and semantic segmentation. The researchers plan to further advance VAN via structural improvements and explore its applicability and performance in large-scale self-supervised and transfer learning and as a general model in areas such as NLP.

The code is available on the project’s GitHub. The paper Visual Attention Network is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “Tsinghua & NKU’s Visual Attention Network Combines the Advantages of Convolution and Self-Attention, Achieves SOTA Performance on CV Tasks

Leave a Reply

Your email address will not be published.

%d bloggers like this: