Object detection is a fundamental computer vision task, typically performed by detectors comprising a task-agnostic backbone and independently developed necks and heads that incorporate detection-specific prior knowledge. Due to the de facto design of convolutional networks (ConvNets), the most commonly used backbones have been multi-scale, hierarchical architectures.
While recently introduced vision transformers (ViTs) have shown their potential as backbones for visual recognition tasks, the original ViT is a plain, non-hierarchical architecture that maintains a single-scale feature map throughout, making it less effective than ConvNets for object detection tasks, particularly when dealing with multi-scale objects and high-resolution images. As such, computer vision researchers may ask: Is a plain ViT too inefficient for use on high-resolution image detection tasks, and should we instead re-introduce hierarchical designs into the backbone?
The new Meta AI paper Exploring Plain Vision Transformer Backbones for Object Detection makes the case for an effective use of the plain, non-hierarchical ViT as a backbone network for object detection — proposing a design that enables the original ViT to be fine-tuned for object detection without the need to redesign a hierarchical backbone for pretraining. The paper notes this decoupling of pretraining design from fine-tuning demands maintains the independence of upstream vs. downstream tasks, as has been the case for ConvNet-based research.

The researchers show that with minimal adaptations for fine-tuning, their plain-backbone ViT Detector (ViTDet) can achieve performance competitive with detectors based on traditional hierarchical backbones.

The proposed ViTDet builds a simple feature pyramid from only the last feature map of a plain ViT backbone and uses simple non-overlapping window attention to effectively extract features from high-resolution images. A small number of cross-window blocks — which could be global attention or convolutions — are also adopted to propagate information. These adaptations are all made only during fine-tuning and so do not affect pretraining.
An empirical study reveals that the ViTDet’s simple design achieves surprisingly good results, with the researchers concluding:
- It is sufficient to build a simple feature pyramid from a single-scale feature map without the common feature pyramid networks (FPN) design.
- It is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks.


Even more surprisingly, the team finds that ViTDet can compete with leading hierarchical-backbone detectors such as Swin and Multiscale Vision Transformers (MViT). Leveraging Masked Autoencoder (MAE) pretraining, ViTDet outperforms these hierarchical counterparts, reaching up to 61.3 AP on bounding-box object detection on the COCO dataset using only ImageNet-1K pretraining.
Overall, this work demonstrates that plain-backbone detection has significant potential in object detection tasks. The proposed approach largely maintains the independence of strong general-purpose backbones and downstream task-specific designs, a decoupling of pretraining from fine-tuning that the team hopes may also benefit and consolidate research efforts in the computer vision and natural language processing fields.
The paper Exploring Plain Vision Transformer Backbones for Object Detection is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
thanks for this information I got a lot of benefits