Site icon Synced

DeepMind Unlocks Web-Scale Training for Open-World Detection

Open-vocabulary object detection plays a crucial role in numerous real-world computer vision tasks. However, due to the scarcity of detection training data and the fragility of pre-trained representations, the performance of trained models often falls short, revealing a lack of scaling potential.

Although the deficiency of detection data can potentially be addressed by utilizing Web image-text pairs as a form of weak supervision, such an approach has yet to be implemented on a large scale for image-level training.

In response to this issue, the DeepMind research team introduces the OWLv2 model in their latest paper, Scaling Open-Vocabulary Object Detection. This optimized architecture, not only enhances training efficiency but also applies the OWL-ST self-training recipe to the proposed OWLv2, substantially improving detection performance. As a result, it achieves a state-of-the-art result in the open-vocabulary detection task.”

The goal of this work is to optimize label space, annotation filtering, and training efficiency for the open-vocabulary detection self-training approach, to yield strong and scalable open-vocabulary performance with few labeled data.

The proposed simple self-training approach consists of three steps: 1) The team first uses an existing open-vocabulary detector to perform open box detection on WebLI, a large Web image-text dataset; 2) Then they use OWL-ViT CLIP-L/14 to annotate all WebLI images with bounding box pseudo annotations; 3) In the last step they fine-tune the trained model on human-annotated detection data, which further improves detection performance.

In particular, the researchers use a variant of the OWL-ViT architecture to train better detectors. In this architecture, they leverage contrastively trained image-text models to initiate image and text encoders while randomly initiating the detection heads.

In the training stage, they use the same losses and augment queries with “pseudo-negatives” of the OWL-ViT architecture and optimize training efficiency to maximize the number of the given seen images. They also adopt previously proposed practices for large-scale Transformer training to improve training efficiency. Together, the resulting OWLv2 model reduces training FLOPS by approximately 50% and speed up training throughput by 2× compared to the original OWL-ViT.

In their empirical study, the team compared their proposed approach to the previous state-of-the-art open vocabulary detectors, OWL-ST improves AP on LVIS rare classes from 31.2% to 44.6%, and combining the OWL-ST recipe with the OWLv2 architecture achieves new state-of-the-art performance.

Overall, the proposed OWL-ST recipe deliver significant improvements in detection performance using weak supervision from large scale wed data, which unlock the web-scale training for open-world localization.

The paper Scaling Open-Vocabulary Object Detection on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Exit mobile version