It’s often said that “a picture is worth a thousand words.” Most object detectors used in contemporary multimodal understanding systems however can only identify a fixed vocabulary of objects and attributes in an input image. These independently pretrained object detectors are essentially black boxes, with perceptive capability restricted to detected objects and not the entire image. Moreover, such systems limit any co-training with other modalities as context, resulting in an inability to recognize novel combinations of concepts that can be expressed in free-form text.
To address these issues, a research team from NYU and Facebook has proposed MDETR, an end-to-end modulated detector that identifies objects in an image conditioned on a raw text query and is able to capture a long tail of visual concepts expressed in free-form text.
Based on the DETR detection system introduced by Facebook in 2020, MDETR performs objection detection with natural language understanding, enabling end-to-end multimodal reasoning. It relies solely on text and aligned boxes as a form of supervision for concepts in an image and can detect nuanced concepts from free-form text.
The researchers summarize their study’s contributions as:
- Introduce an end-to-end text-modulated detection system derived from the DETR detector.
- Demonstrate that the modulated detection approach can be applied seamlessly to solve tasks such as phrase grounding and referring expression comprehension, setting new state-of-the-art performance on both these tasks using datasets having synthetic as well as real images.
- Show that good modulated detection performance naturally translates to downstream task performance, for instance achieving competitive performance on visual question answering, referring expression segmentation, and on few-shot long-tailed object detection.
In the MDETR architecture, images are encoded by a convolutional backbone and texts are encoded by a pretrained transformer language model such as RoBERTa. With visual and text features at hand, a modality-dependent linear projection then projects both to a shared embedding space. The resulting feature vectors are concatenated and fed to a transformer encoder-decoder that predicts the represented objects’ bounding boxes and corresponding text.
The researchers conducted experiments on the CLEVR dataset to evaluate MDETR’s performance. Their setup employed a ResNet-18 model pretrained on ImageNet as the convolutional backbone, a pretrained DistilRoberta as a text-encoder, and a final transformer that is the same as DETR’s.
In a zero-shot setting, the generalization capability of MDETR demonstrated a substantial improvement over the best competing model. Notably, the accuracy of MDETR on the CLEVR-REF+ dataset reached 100 percent, greatly outperforming other approaches.
The researchers also evaluated the proposed model on four downstream tasks — referring expression comprehension and segmentation, visual question answering and phrase grounding — where it achieved state-of-the-art results on popular benchmarks.
The results validate the proposed approach’s strong performance on multimodal understanding tasks as well as its potential in downstream applications. The researchers believe this work can contribute to the development of fully integrated multimodal architectures that do not rely on black-box object detectors.
The paper MDETR – Modulated Detection for End-to-End Multi-Modal Understanding is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.