A new study by South China University of Technology and Tencent WeChat AI researchers is the latest fruitful attempt to utilize transformer architectures in object detection. The team proposes a pretext task they call random query patch detection to unsupervisedly pretrain DETR (DEtection TRansformer) for object detection. The Unsupervised Pre-training DETR (UP-DETR) significantly improves DETR performance, with faster convergence and higher precisions on the popular object detection datasets PASCAL VOC and COCO.
Introduced this May by Facebook AI Research, the DETR framework views object detection as a direct prediction problem through a transformer encoder-decoder architecture. It has reached performance competitive with SOTA methods such as the Faster R-CNN baseline.
“However, DETR comes with training and optimization challenges, which needs large-scale training data and an extreme long training schedule,” the team notes. Such drawbacks have been holding back further DETR performance improvements. Delving into the DETR structure, the researchers determined that the CNN backbone had been pretrained to extract good visual representations, but the transformer module was not pretrained.
Could this be the key to better performance? Unsupervised visual representation learning has seen remarkable progress with well-designed pretext tasks, with models such as MoCo and SwAV standing out. But current pretext tasks cannot be directly applied to pretrain DETR, which mainly focuses on spatial localization learning rather than image instance-based or cluster-based contrastive learning.
Generally, unsupervised learning computer vision pipelines include a pretext task and a real downstream task that can involve classification or detection with insufficient annotated data. The pretext task needs to learn visual representations that will be used in downstream tasks.
The team set out to design a novel pretext task for pretraining transformers based on the DETR architecture for object detection, developing a random query patch detection method to pretrain an UP-DETR detector without any human annotations. After randomly cropping multiple query patches from the input images, they pretrained the transformer for detection, predicting bounding boxes of query patches in the given image. This approach solved two critical issues:
- Multi-task learning: Avoid query patch detection destroying the classification features
- Multi-query localization: Different object queries focus on different position areas and box sizes. For multi-query patches, the researchers developed object query shuffle and attention mask approaches to solve the assignment problems between query patches and object queries.
In evaluations, UP-DETR outperformed DETR by a large margin with higher precision and much faster convergence. On the challenging COCO dataset, UP-DETR delivered 42.8 AP (Average Precision) with a ResNet50 backbone, outperforming DETR in both convergence speed and precision.
The researchers say they hope future studies can integrate CNN and transformer pretraining into a unified end-to-end framework and apply UP-DETR to additional downstream tasks such as few-shot object detection and object tracking.
The paper UP-DETR: Unsupervised Pre-training for Object Detection with Transformers is on arXiv.
Reporter: Fangyu Cai | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.