Open-vocabulary object detection plays a crucial role in numerous real-world computer vision tasks. However, due to the scarcity of detection training data and the fragility of pre-trained representations, the performance of trained models often falls short, revealing a lack of scaling potential.
Although the deficiency of detection data can potentially be addressed by utilizing Web image-text pairs as a form of weak supervision, such an approach has yet to be implemented on a large scale for image-level training.
In response to this issue, the DeepMind research team introduces the OWLv2 model in their latest paper, Scaling Open-Vocabulary Object Detection. This optimized architecture, not only enhances training efficiency but also applies the OWL-ST self-training recipe to the proposed OWLv2, substantially improving detection performance. As a result, it achieves a state-of-the-art result in the open-vocabulary detection task.”
The goal of this work is to optimize label space, annotation filtering, and training efficiency for the open-vocabulary detection self-training approach, to yield strong and scalable open-vocabulary performance with few labeled data.
The proposed simple self-training approach consists of three steps: 1) The team first uses an existing open-vocabulary detector to perform open box detection on WebLI, a large Web image-text dataset; 2) Then they use OWL-ViT CLIP-L/14 to annotate all WebLI images with bounding box pseudo annotations; 3) In the last step they fine-tune the trained model on human-annotated detection data, which further improves detection performance.
In particular, the researchers use a variant of the OWL-ViT architecture to train better detectors. In this architecture, they leverage contrastively trained image-text models to initiate image and text encoders while randomly initiating the detection heads.
In the training stage, they use the same losses and augment queries with “pseudo-negatives” of the OWL-ViT architecture and optimize training efficiency to maximize the number of the given seen images. They also adopt previously proposed practices for large-scale Transformer training to improve training efficiency. Together, the resulting OWLv2 model reduces training FLOPS by approximately 50% and speed up training throughput by 2× compared to the original OWL-ViT.
In their empirical study, the team compared their proposed approach to the previous state-of-the-art open vocabulary detectors, OWL-ST improves AP on LVIS rare classes from 31.2% to 44.6%, and combining the OWL-ST recipe with the OWLv2 architecture achieves new state-of-the-art performance.
Overall, the proposed OWL-ST recipe deliver significant improvements in detection performance using weak supervision from large scale wed data, which unlock the web-scale training for open-world localization.
The paper Scaling Open-Vocabulary Object Detection on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.