Image classification is the task of assigning a semantic label from a predefined set of classes to an image. One of the open questions in computer vision (CV) is whether automatic image classification can be achieved without the use of ground-truth annotations.
Researchers from Katholieke Universiteit Leuven in Belgium and ETH Zürich in a recent paper propose a two-step approach for unsupervised classification. Experimental evaluation shows the method outperforming prior work by huge margins across multiple datasets, according to the researchers.
Automatic image classification without labels echos a shift of focus in the CV research community from supervised learning methods based on convolutional neural networks to new self-supervised and unsupervised methods. Recent approaches have also tried to deal with a lack of labels by using end-to-end learning pipelines that combine feature learning with clustering.
The researchers propose a two-step method that decouples feature learning and clustering to leverage the advantages of both representation and end-to-end learning approaches while also addressing the shortcomings of each.
The model first learns feature representations through a pretext task — mining the nearest neighbours of each image based on feature similarity. Based on their empirical finding that the nearest neighbours tend to belong to the same semantic class in most cases, the researchers show that mining nearest neighbours from a pretext task can then be used as a prior for semantic clustering.
The second step integrates the semantically meaningful nearest neighbours as a prior into a learnable approach. By using a loss function to maximize their dot product after softmax and pushing the network to produce both consistent and discriminative predictions, each image and its mined neighbours are classified together.
Unlike with end-to-end approaches, these learned clusters depend more on meaningful features than on network architecture. This helps prevent the clustering process from latching onto low-level features such as colour at the beginning of training, the researchers explain.
Experimental evaluations were performed on CIFAR10, CIFAR100- 20, STL10, and ImageNet; with results compared to SOTA methods on three benchmarks based on clustering accuracy, normalized mutual information, and adjusted rand index. The proposed method outperforms prior work on all three metrics, achieving a 26.9 percent increase on CIFAR10 and a 21.5 percent increase on CIFAR100-20 in terms of accuracy.
Moreover, the encouraging results on ImageNet demonstrate that semantic clustering can be applied to large-scale datasets — validating the researchers’ assumption that separation between learning semantically meaningful features and clustering is arguably better than recent end-to-end approaches.
The paper Learning To Classify Images Without Labels is on arXiv.
Journalist: Yuan Yuan | Editor: Michael Sarazen