Deep Learning dominates Computer Vision. We are all in this trend. Since AlexNet suddenly swept the competition by at least 10 percent than all other competitors in ILSVRC2012, Deep Learning never stops its pace outperforming in Image- and Video-related tasks.
With years of developments, current algorithms are fairly advanced: various delicate models, fancy learning paradigms, and combining the ideas of traditional methods. However, algorithms need high-quality data. Everybody knows Deep Learning hardly succeeds with insufficient or poor data.
Where do those data come from? ImageNet can be the first choice among many engineers and researchers. But it took almost three years and many following efforts to keep it updated. Moreover, it only supports object recognition problems. If you want to perform image segmentation, there is little chance for researchers to find suitable high-quality dataset. I saw many research groups, even companies, publishes their data-labeling requests on Amazon Mechanical Turk where provides the on-demand scalable workforce. Sadly, human-labeled data often requires much more time (and money) than machines, let alone quite high errors. Thus, it is surely important providing workforce with a tool that can automatically generate the rough segmentation and allow human to adjust easily and precisely.
Huh, this is exactly what Polygon-RNN does.
This work creatively formulates object annotation problem as polygon prediction rather than traditional pixel labeling. Obtaining data in fast fashion is critical when data size becomes the bottleneck of Deep Learning, and their work supplies researchers with a flexible annotation method to quickly access high-quality ground truth labels.
In this work, authors call their method as semi-automatic annotation of object instances. Generally speaking, this method requires users to provide the bounding box around the target object first, then Polygon-RNN automatically generates the first draft segmentation; if the user doesn’t think it is proper, he or she could adjust the outline anytime. What makes adjusting operation easy is exactly the idea of Polygon: vertices enable users to quickly locate the incorrect part and correct it.
Their model could be divided into two parts:
- a modified VGG architecture for capturing semantic information
- a two-layer convolutional LSTM for generating the vertices of the polygon
And this model is trained in end-to-end mode, which allows CNN learns adjustments accordingly. In particular, a Convolutional LSTM is adopted for the following reasons: 1> it receives information from previous VGG model so the ConvLSTM operating in 2D is a perfect match; 2> the nature of convolution allows a number of parameters as few as possible. The novelty is that vertex prediction is formulated as classification problem given the image feature representations learned from CNN, two consecutive vertices right before current prediction, and one-hot encoding of previous predicted vertices as input.
In Prediction mode, the model generates a structurally coherent polygon around the target object without any user corrections, it only needs the proper bounding box of the target instance. Giving a bounding box is always easier and less error-prone than drawing an outline, right? And in the following Annotation mode, users are required to closely examine whether the predicted polygon meets the precision. If not, users could easily adjust the vertices, and the model will automatically output new vertices according to original polygon and users’ inputs until it arrives at the threshold. By repeating this process, often with several clicks, a high-quality annotation is beautifully done.
The evaluation was done on the Cityspaces dataset, and standard IOU measure is given.
This paper received an honorable mention for the best papers award in CVPR2017. And during the conference, we saw many new types of researches often encounter the problem of no suitable (enough) training data. Now, data has become the bottleneck of advanced algorithms, and it is inevitable for researchers to build a larger general purpose dataset. Furthermore, the quality of data is also important, since low-quality data may severely lower the performance, even if the model may be good enough. This paper proposes not only a method as a useful tool but also a new concept for modeling complex annotation as a simple polygon. We think the methods to acquire data are on the way, and this work is definitely worth scrutinizing.