A new paper from the Institute of Automation, CAS, and Microsoft Research Asia presents a novel attention-based decoder module designed to better integrate different computer vision (CV) object representations.
Today’s CV object detection frameworks have greatly improved, with most targeting and delivering strong performance on one aspect of an object’s structure. The heterogeneous nature of feature extractions (rectangle box, center point, point set, etc.) by different representations however has made it difficult to integrate them into one framework. The researchers propose a general module, BVR (bridging visual representations), which combines these visual representations and their different strengths in one single framework.
The researchers applied an attention-based decoder module, similar to that in Transformer architectures, to model dependencies between the heterogeneous features. It takes an object detector’s main representations as the query input, while other visual representations enhance the query features with regard to appearance and geometric relationships.
The team says they solved the issue of computation and memory consuming by incorporating two novel techniques — a key sampling approach and a shared location embedding approach — into the BVR module. The module is general, can work in-place, and is convenient to use.
In experiments, BVR was found to be effective with prevalent object detectors RetinaNet, Faster R-CNN, FCOS and ATSS. The researchers applied BVR to each detector and compared their performance to state-of-the-art methods, with BVR achieving 1.5 ∼ 3.0 AP (average precision) improvements.
The team hopes the work will help researchers build better object detection algorithms and benefit object-oriented visual applications. They also caution, as we increasingly see in the CV domain, that care should be taken to avoid issues with biased training data or improper deployment.
The paper RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder has been accepted by the 34th Conference on Neural Information Processing Systems (NeurIPS 2020) and is on arXiv. The codes will soon be made available on GitHub.
Analyst: Reina Qi Wan | Editor: Michael Sarazen; Yuan Yuan
Thinking of contributing to Synced Review? Synced’s new column Share My Research welcomes scholars to share their own research breakthroughs with global AI enthusiasts.
Pingback: [R] ‘Bridging Visual Representations’ Decoder Integrates CV Object Detection Frameworks – tensor.io
very good nice