In the five years since their introduction, transformer architectures have come to dominate the natural language processing research field. Recently, vision transformers (ViT) have also demonstrated their power and potential across a wide range of computer vision tasks. While the success of transformers is largely attributed to their unique self-attention (SA) mechanism, SA’s quadratic complexity over the number of visual tokens disadvantages transformers when handling high-resolution image inputs.
In the new paper Focal Modulation Networks, a Microsoft Research team proposes FocalNet (Focal Modulation Network), a simple and attention-free architecture designed to replace transformers’ SA module. FocalNets enable input-dependent token interactions for visual modelling and exhibit remarkable superiority compared to SA, opening a promising alternative avenue for effective and efficient visual modelling in real-world applications.


The paper first presents an analysis of window-wise attention and focal attention — current state-of-the-art SA methods — noting how the Swin Transformer’s simple window-shift strategy enables it to outperform ResNets across various vision tasks. Focal attention meanwhile was introduced to expand the receptive field by additionally aggregating summarized distanced visual tokens to enable the model to capture coarse-grained, long-range visual dependencies. Both methods however involve heavy aggregation between the query and a large number of spatially distributed tokens. This leads the researchers to ask: “Is there a more efficient and effective way than (hybrid) SA to model input-dependent long-range interactions?”
The team’s proposed method first aggregates contexts around each query, then modulates the query with the aggregated context. This enables input-dependent token interaction while significantly simplifying the process and making the interactions relatively lightweight. Moreover, it is possible to apply query-agnostic aggregations to generate summarized tokens at different levels of granularity. These summarized contexts can then be selectively aggregated according to the query content, and fused into the query vector.
Similar to focal attention, the proposed focal modulation method performs multiple levels of aggregation to capture fine- and coarse-grained visual contexts; but unlike focal attention, which extracts the summarized tokens at target locations followed by attention, focal modulation extracts at each query position and replaces the attention with a simple modulation. Replacing normal SA with the proposed method results in the simpler, attention-free FocalNet architecture.
The paper identifies several benefits of their proposed focal modulation approach:
- It can naturally leverage the built-in convolution operation for fast and translation invariant context encoding (or contextualization).
- It does not require window partitioning, positional embedding, separate heads and softmax etc., which allows fast adaptation to different resolutions and tasks.
- With a few stacked contextualization levels, it can rapidly capture a large effective receptive field, thus is more efficient for high-resolution image encoding than SA and focal attention models.
To validate FocalNet’s effectiveness, the team performed experiments on tasks such as image classification, object detection and segmentation.
In the evaluations, tiny- and base-model FocalNets achieved 82.3 percent and 83.9 percent top-1 accuracy, with comparable throughput compared to the Swin Transformer and double the throughput of the Focal Transformer, respectively. The proposed models also recorded top-1 accuracy of 86.5 percent and 87.3 percent on 224^2 and 384^2 square resolution when pretrained on ImageNet-22K, with a cost similar to Swin Transformer. In object detection on the COCO dataset, FocalNets achieved 46.1 and 49.0 box mean average precision (mAP) on Mask R-CNN 1x, surpassing Swin with a 3x schedule (46.0 and 48.5 box mAP). FocalNets also outperformed Swin on semantic separation tasks and demonstrated superior performance across model sizes when applied to monolithic architectures in ViTs.
Overall, the study shows that the proposed FocalNets can consistently and significantly outperform their state-of-the-art SA counterparts on a wide range of tasks and with comparable costs, validating focal modulation as a strong and promising alternative architecture for effective and efficient visual modelling.
The FocalNet code is available on the project’s GitHub. The paper Focal Modulation Networks is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Microsoft’s FocalNets Replace ViTs’ Self-Attention With Focal Modulation to Improve Visual Modelling”