The 2016 release at CVPR of the YOLO (You Only Look Once) real-time object detector revolutionized the field of computer vision. YOLO delivered unprecedented speed and accuracy on a fundamental task with applications in autonomous driving, robotics, security, medical image analysis and more. Various techniques and tricks (multi-scale predictions, a better backbone classifier, etc.) have since been implemented to improve YOLO training and boost performance.
A research team from Taiwan’s Institute of Information Science, Academia Sinica furthers YOLO development in their new paper YOLOv7: Trainable Bag-Of-Freebies Sets New State-Of-The-Art for Real-Time Object Detectors. This latest YOLO version introduces novel “extend” and “compound scaling” methods that effectively utilize parameters and computation; and surpasses all known real-time object detectors in speed and accuracy.
The team summarizes their main contributions as:
- We design several trainable bag-of-freebies methods, so real-time object detection can greatly improve accuracy without increasing the inference cost.
- For the evolution of object detection methods, we found two new issues, namely how the re-parameterized module replaces the original module, and how the dynamic label assignment strategy deals with an assignment to different output layers. In addition, we also propose methods to address the difficulties arising from these issues.
- We propose “extend” and “compound scaling” methods for the real-time object detector that can effectively utilize parameters and computation.
- The method we proposed can effectively reduce about 40% of the parameters and 50% of the computation of state-of-the-art real-time object detectors, and has faster inference speed and higher detection accuracy.
The team starts by building an efficient architecture. Their extended efficient layer aggregation network (Extended-ELAN, or E-ELAN) uses expand, shuffle, and merge cardinality to continuously enhance the network’s learning ability without changing the original gradient path, i.e. it changes only the computational block and leaves the transition layer untouched.
For model scaling, the researchers propose a compound scaling method that can be applied to concatenation-based architectures and calculate changes in the output channel of a computational block to enable depth factor scaling. This proposed compound scaling method can thus maintain the properties of the original model design and the optimal structure.
In their empirical study, the researchers compared the proposed YOLOv7 with state-of-the-art object detectors. YOLOv7 achieved 1.5 percent higher AP than YOLOv4 despite having 75 percent fewer parameters and using 36 percent less computation. When trained only on the MS COCO dataset and without any pretrained weights, YOLOv7 beat all other popular detectors (YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B) in the evaluations.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.