End-to-end transformer-based object detectors (DETRs) play a crucial role in applications such as object tracking, video surveillance and autonomous driving. Although DETRs have made significant progress in both speed and accuracy, they have high computational costs and suffer inference delays caused by non-maximum suppression (NMS) on real-time detectors.
In the new paper DETRs Beat YOLOs on Real-Time Object Detection, a Baidu Inc. research team presents Real-Time Detection Transformer (RT-DETR), a real-time end-to-end object detector that leverages a hybrid encoder and novel IoU-aware query selection to address inference speed delay issues. RT-DETR outperforms YOLO object detectors in both accuracy and speed.

The team summarizes their main contributions as follows:
- We propose the first real-time end-to-end object detector, which not only outperforms current state-of-the-art real-time detectors in terms of accuracy and speed, but also requires no post-processing, so the inference speed is not delayed and remains stable.
- We analyze the influence of NMS on real-time detectors in detail and draw a conclusion about CNN-based real-time detectors from a post-processing perspective.
- Our proposed IoU-aware query selection shows excellent performance improvement in our model, which sheds new light on improving the initialization scheme of object queries.
- Our work provides a feasible solution for the real-time implementation of end-to-end detectors, and the proposed detector can flexibly adjust the model size and the inference speed by using different decoder layers without the need for retraining.

The proposed RT-DETR pipeline comprises a ResNet/HGNetv2 backbone, a hybrid encoder, and a transformer decoder with auxiliary prediction heads. The features of the last three stages of the backbone are fed into the hybrid encoder, which transforms these multi-stage features into a sequence of image features using intra-scale feature interaction and cross-scale fusion. A novel IoU-aware query selection technique is then employed to select a fixed number of image features from the encoder output sequence, which serve as the initial object queries for the decoder. Finally, the decoder iteratively optimizes the object queries to produce bounding boxes and confidence scores.
By applying their hybrid design in place of the original transformer, the team reduces computational redundancy and enables RT-DETR to efficiently process features with different scales. The IoU-aware query selection meanwhile provides higher quality initial object queries to the decoder to further boost model performance. The detector also enables flexible inference speed adjustments by selecting different decoder layers (without any retraining) to facilitate its practical real-time application.

In their empirical study, the team compared RT-DETR with baseline real-time and end-to-end object detectors such as YOLO, PPYOLOE, Efficient-DETR, etc. In the experiments, RT-DETR-L achieved 53.0 percent AP and 114 FPS; and RT-DETR-X achieved 54.8 percent AP and 74 FPS, surpassing all YOLO detectors of the same scale in both speed and accuracy.
Overall, this work verifies the proposed RT-DETR’s suitability as a real-time end-to-end detector that avoids NMS-related inference delays and achieves state-of-the-art performance in both speed and accuracy compared to YOLOs.
Source code and pretrained models will be available on the PaddleDetection GitHub. The paper DETRs Beat YOLOs on Real-Time Object Detection is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Look Again, YOLO: Baidu’s RT-DETR Detection Transformer Achieves SOTA Results on Real-Time Object Detection”