Cow? Horse? Camel?
At 320 x 320, YOLOv3 runs in 22 ms at 28.2 mAP, as accurate but three times faster than SSD. It also runs almost four times faster than RetinaNet, achieving 57.9 AP50 in 51 ms on a Pascal Titan X.
The first generation of YOLO was published on arXiv in June 2015. The model framed objects separated by bounding boxes and associated class probabilities to treat them as a regression problem. A base YOLO model could detect images in real-time at 45 frames per second, while Fast YOLO was capable of processing 155 frames per second, while still outperforming other real-time detectors.
In 2016 Redmon and Farhadi developed YOLO9000, which could detect up to 9,000 object categories using the improved YOLOv2 model. At 67 frames per second, the detector scored 76.8 mAP on the visual object classes challenge VOOC 2007, beating methods such as Faster RCNN. The model was also trained to detect unlabelled objects.
The new YOLOv3 follows on YOLO9000’s methodology and predicts bounding boxes using dimension clusters as anchor boxes. It then guesses an objectness score for each bounding box using logistic regression. The model next predicts boxes at three different scales, extracting features from these scales using a similar concept to feature pyramid networks. Redmon uses a hybrid approach to perform feature extraction, building on former YOLOv2, Darknet-19 and residual networks. The new network, Darketnet-53, is significantly larger and has 53 convolutional layers.
When the duo ran YOLOv3 on Microsoft’s COCO Dataset it performed on par with RetinaNet and SSD variants, indicating the model’s strength at fitting boxes to objects. However when the IOU threshold raises the model struggles to align boxes perfectly with objects. Redmon and Farhadi say the model does not work well on average AP between 0.5 and 0.95 IOU metric, but performs very well on a threshold metric of 0.5 IOU. It also performs better with small objects than with large objects.
On a side note, it’s worth mentioning that Redmon and Farhadi’s paper is not only a step forward in object detection, it’s also peppered with humour. Andrej Karpathy retweeted that the paper “reads like good stand up comedy.”
Ali Farhadi is the Associate Professor of Computer Science and Engineering at the University of Washington. He also leads Project Plato — which uses computer vision to extracting visual knowledge — at the Allen Institute of Artificial Intelligence. His student Joseph Redmon is the YOLO paper’s first author. Redman’s personal website is called Survival Strategies for the Robot Rebellion.
Journalist: Meghan Han | Editor: Michael Sarazen