A team of researchers from Google Brain in Zürich and DeepMind London believe one of the world’s most popular image databases may need a makeover. ImageNet is an unparalleled computer vision reference point with more than 14 million labelled images. It was designed for visual object recognition software research and is organized according to the WordNet hierarchy. Each node of the hierarchy is depicted by hundreds and thousands of images, and there are currently an average of over 500 images per node.
In a paper published last year, the Google Brain Zürich team proposed Big Transfer (BiT-L), now a SOTA ImageNet model. Looking at what were considered “mistakes” in BiT-L, Google Brain researcher Lucas Beyer suggested most of these could in fact be label noise rather than genuine model mistakes.
To quantify this idea, Beyer and his Google Brain colleagues joined DeepMind researchers in a recent study to determine “whether recent progress on the ImageNet classification benchmark continues to represent meaningful generalization, or whether the community has started to overfit to the idiosyncrasies of its labeling procedure.”
The idea of building up ImageNet was first proposed by Stanford professor and renowned AI researcher Fei-Fei Li in 2006. According to Wire, Li began working on the idea at a time when most AI research focused on models and algorithms, and she wanted to expand and improve the data available to train AI algorithms.
Li assembled a team of researchers for the ImageNet project and they introduced the database as a poster at the 2009 Conference on Computer Vision and Pattern Recognition (CVPR) in Florida.
In 2010, the ImageNet project launched an annual challenge — the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) — that saw software programs compete to correctly classify and detect objects and scenes. ImageNet as the ILSVRC image classification benchmark has since become a key testbed for research in artificial perception.
The scale and difficulty of ImageNet helped push some landmark achievements in machine learning, such as the breakthrough AlexNet in 2012 and 2015’s ResNet. More importantly, success on ImageNet has proven to generalize well — the SOTA techniques on ImageNet have often found success across other tasks and domains.
The ImageNet project has traditionally crowdsourced its annotation process via Amazon Mechanical Turk to help with the classification of images. The Google and DeepMind researchers propose a more robust procedure for collecting human annotations to curate an ImageNet validation set using their newly generated Reassessed Labels (ReaL) to reevaluate the accuracy of recently proposed ImageNet classifiers.
“I did expect us to find some problems with ImageNet labels,” Beyer told Synced in an email. “However, I had no idea what the likely conclusion would be, which made this project especially interesting.”
The researchers compared ReaL with the original ImageNet labels and found the original labels were no longer the best predictors of their newly-collected annotations. Through experiments, ReaL was able to remove more than half of the ImageNet labelling mistakes, which they say implies that the ReaL labels provide a superior estimate of the model accuracy.
Analyzing discrepancies between ImageNet and ReaL accuracy, they found that some of the “progress” on the original metric was due to overfitting to the idiosyncrasies of the labelling pipeline. They also identified a number of the ImageNet labelling pipeline’s limitations in their paper.
Firstly, real-world images often contain multiple objects of interest. ImageNet annotations however are limited to assigning a single label to each image, which can lead to a gross underrepresentation of image content. ImageNet classes also contain a handful of essentially duplicate pairs, which draw a distinction between semantically and visually indistinguishable groups of images.
In addition, the ImageNet annotation pipeline queries the Internet for images of a given class, then asks human annotators whether that class is indeed present in the image. While this procedure yields reasonable descriptions of the image, it may also lead to overly restrictive label proposals and inaccuracies. When considered in isolation, a particular proposed label may be a plausible description of an image, but there may also be other ImageNet classes that are more suitable descriptions of the image.
Motivated to re-annotate the ImageNet validation set to capture the diversity of image content in real-world scenes, the researchers designed a labelling procedure that allows human annotators to consider and contrast a wide variety of potential labels so as to select the most accurate description(s) while keeping the number of proposals sufficiently small to enable robust annotations.
The researchers also proposed two modifications to the canonical supervised training setup aimed at addressing the identified shortcomings of ImageNet labels. They say that taken together, these modifications yield relatively large empirical gains and indicate that label noise could have been a limiting factor for longer training schedules.
Based on the overall findings, Beyer concludes that although the original set of labels may be nearing the end of their useful life, “with our new labels, we can likely still use ImageNet for a few more years.” Also, ImageNet’s role as a useful benchmark can now be more effectively monitored by using both the new and original labels.
Beyer adds that the team’s investigations in the paper concern only the test function of ImageNet when used to evaluate models, so the community will still be able to use ImageNet for model (pre-)training purposes even if it’s no longer an accurate benchmark.
The paper Are We Done With ImageNet? is on arXiv.
Journalist: Yuan Yuan | Editor: Michael Sarazen
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.