Convolution neural networks (CNNs) and self-attention are two of today’s most popular techniques for representation learning. Both approaches have achieved SOTA results across a wide range of computer vision (CV) tasks, but they are considered distinct, as they are informed by different design paradigms.
Considering the differing yet often complementary properties of convolution and self-attention techniques, an integration method that could combine the benefits of both paradigms would be highly attractive for CV researchers.
In the new paper On the Integration of Self-Attention and Convolution, a research team from Tsinghua University, Huawei Technologies Ltd. and the Beijing Academy of Artificial Intelligence proposes ACmix, a mixed model that leverages the benefits of both self-attention and convolution for CV representation tasks while achieving minimum computational overhead compared to its pure convolution or self-attention counterparts.

The team summarizes ACmix’s contributions as:
- A strong underlying relation between self-attention and convolution is revealed, providing new perspectives on understanding the connections between two modules and inspirations for designing new learning paradigms.
- An elegant integration of the self-attention and convolution module, which enjoys the benefits of both worlds, is presented. Empirical evidence demonstrates that the hybrid model outperforms its pure convolution or self-attention counterparts consistently.

The team first revisits and captures the relationships between convolution and self-attention by decomposing their operations into separate stages.
Convolution is an essential component of contemporary ConvNets and can be summarized in two stages. In the first stage, an input feature map is linearly projected based on the kernel weights from a specific position; while in the second stage, the projected feature maps are shifted according to the kernel positions and finally aggregated together.
Unlike convolution, self-attention enables a model to focus on important regions within a larger-sized context. Multi-head self-attention can also be decomposed into two stages. In the first stage, the method projects an input feature as query, key and value; while in the second stage, it calculates attention weights and aggregates the value matrices by gathering local features.

An analysis of this decomposition of self-attention and convolution modules from various perspectives reveals deeper relationships between the two, which the team summarizes as:
- Convolution and self-attention practically share the same operation on projecting the input feature maps through 1×1 convolutions, which is also the computation overhead for both modules.
- Although crucial for capturing semantic features, the aggregation operations at stage II are lightweight and do not acquire additional learning parameters.
These observations suggest an integration of convolution and self-attention modules could be both natural and productive. As the methods share the same 1×1 convolution operations, the researchers were able to perform a projection once and reuse these intermediate feature maps for different aggregation operations. The proposed ACmix thus also has two stages. In the first stage, an input feature is projected by three 1×1 convolutions and reshaped into pieces. In stage two, for the self-attention path, each group of intermediate features comprises three feature pieces that serve as queries, keys, and values; while for the convolution path, the team shifts and aggregates the generated features so as to process the input feature in a convolution manner and gather information from a local receptive field in the traditional way. Finally, the outputs with learnable scalars from both paths are combined.
A highly desirable benefit of the ACmix approach is that computational cost can be reduced compared to pure convolution or self-attention methods. In the first stage, the ACmix cost is the same as self-attention and less than traditional convolution, while in stage two, ACmix’s computation complexity is linear with regard to channel size and comparably minor compared to stage one.
In their empirical analysis, the team compared ACmix with SOTA models on ImageNet classification, semantic segmentation, and object detection tasks.


On the image classification tasks, ResNet-ACmix outperformed all baselines with comparable FLOPs or parameters. ResNet-ACmix 26 achieved the same top-1 accuracy as SASA-ResNet 50 with 80 percent FLOPs, and SAN-ACmix 15 outperformed SAN 19 with 80 percent FLOPs. PVT-ACmix-T meanwhile achieved performance comparable with PVT-Large with only 40 percent of the FLOPs. ACmix also consistently outperformed baselines with similar parameters or FLOPs on semantic segmentation and object detection tasks.
Overall, this work demonstrates the effectiveness and efficiency of the proposed ACmix, and the team hopes their study can provide new insights and inspire new research directions aimed at the integration of self-attention and convolution.
The code and pretrained models will be released on the project’s GitHub. The paper On the Integration of Self-Attention and Convolution is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: r/artificial - [R] Integrating Self-Attention and Convolution: Tsinghua, Huawei & BAAI’s ACmix Achieves SOTA Performance on CV Tasks With Minimum Cost - Cyber Bharat