Object detection is a fundamental task in computer vision, and YOLO (You Only Look Once) one-stage object detectors have set the performance standard since the YOLOv1 debut in 2015. The YOLO series has undergone significant network and structure changes over the years, with the latest version, YOLOX, achieving an optimal balance of speed and accuracy with 50.1 mAP at 68.9 FPS on the NVIDIA Tesla V100 Tensor Core GPU.
Inspired by YOLOX, Baidu researchers have optimized their previous PP-YOLOv2 model to introduce PP-YOLOE, a state-of-the-art industrial object detector that outperforms YOLOv5 and YOLOX in terms of speed and accuracy trade-off. The team’s PP-YOLOE-l variant surpasses PP-YOLOv2 by 1.9 percent AP and YOLOX-l by 1.3 percent AP on COCO datasets.

The PP-YOLOv2 baseline model architecture has three components: 1) a ResNet50-vd backbone with deformable convolution, 2) a PAN (path aggregation network) neck with an SPP layer and DropBlock, and 3) a lightweight IoU aware head. Similar to YOLOv3, PP-YOLOv2 only assigns one anchor box for each ground truth object. This mechanism however requires a number of extra hyperparameters and relies heavily on hand-crafted design that may not generalize well when trained on other datasets.

To address this issue, the Baidu researchers introduce an anchor-free method to PP-YOLOv2 that tiles one anchor point on each pixel and sets upper and lower bounds for detection heads to assign ground truths to a corresponding feature map. The centre of a bounding box can then be calculated to select the closest pixel as positive samples. A 4D vector is also predicted for regression, with the modifications resulting in slight model speedups and precision drops.

The team gains backbone and neck improvements by using a novel RepResBlock to build a CSPRepResNet backbone with one stem composed of three convolution layers and four subsequent stages stacked by RepResBlock. In each stage, cross-stage partial connections are used to reduce parameters and computational burden. Following PP-YOLOv2, the team also builds a neck with RepResBlock and CSPRepResStage.

Width and depth multipliers are employed to jointly scale the basic backbone and neck (like YOLOv5) and obtain a series of detection networks (s/m/l/x) with different parameters and compute costs. This modification boosts AP to 49.5 percent, a 0.7 percent performance improvement. The researchers also use task alignment learning (TAL/TOOD, proposed by Feng et al. in 2021) to replace label assignment, which achieves a 0.9 percent AP improvement and pushes overall AP to 50.4 percent.
Finally, to solve the task conflict between classification and localization in object detection, the researchers use effective squeeze and extraction (ESE) to replace the layer attention in conventional task-aligned one-stage object detection (TOOD), simplify the alignment of classification branches to shortcuts, and replace the alignment of regression branches with a distribution focal loss (DFL) layer. Additional modifications to the resulting Efficient Task-aligned Head (ET-head) further improve performance.
The researchers performed evaluation experiments on the MS COCO-2017 training set, comparing the proposed PP-YOLOE with state-of-the-art object detectors such as YOLOX, YOLOv5 and EfficientDet.


In the tests, the PP-YOLOE-l variant scored 51.4 percent AP with 640 x 640 resolution at a speed of 78.1 FPS, an AP improvement of 1.9 percent and a 13.35 percent speedup compared to PP-YOLOv2, and a 1.3 percent AP improvement and 24.96 percent speedup compared to YOLOX. The proposed model’s inference speed reached 149.2 FPS with TensorRT and FP16-precision.
Overall, PP-YOLOE is shown to be a high-performance object detector. The model series can also smoothly transition to deployment thanks to support from the PaddlePaddle deep learning framework. The team hopes their design updates and encouraging results can inspire developers and researchers working in object detection.
The code is available on the project’s GitHub. The paper PP-YOLOE: An Evolved Version of YOLO is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Baidu Proposes PP-YOLOE: An Evolved Version of YOLO that Achieves SOTA Performance in Object Detection”