The success of deep learning (DL) in real-world applications has inspired many researchers to apply DL in computer vision tasks such as object recognition and 3D reconstruction. Humans know that perspective is critical in understanding our 3D world; and accurate object viewpoint estimation of azimuth, elevation, tilt angles, etc. is also a key link between 2D images and their 3D geometric understanding in computer vision.
Current approaches for estimating viewpoints using neural networks require a large amount of annotated training data. It is challenging to obtain large-scale human-annotated datasets given the complexity of viewpoint annotation. For this reason, DL techniques for 3D reconstruction, while still in the early phase, show immense potential.
An exciting associated research area is in self-supervised learning, which can take advantage of the vast amount of image data on the Internet. Unlike supervised learning that requires annotators manually label the data, self-supervised learning automatically generates labels by extracting weak annotation information from the input data and predicting the rest.
A team of researchers from NVIDIA and Heidelberg University recently introduced an open-source self-supervised learning technique for viewpoint estimation of general objects that draws on such freely available Internet images: “We seek to answer the research question of whether such unlabelled collections of in-the-wild images can be successfully utilized to train viewpoint estimation networks for general object categories purely via self-supervision.”
The researchers proposed a novel analysis-by-synthesis framework leveraging a viewpoint aware image synthesis network to train the viewpoint estimation network. They coupled the viewpoint estimation network (analysis) with a viewpoint aware generative network (synthesis) to form a cycle for training both together. The team used generative consistency, symmetry and discriminator losses to supervise the viewpoint networks. Inspired by the popular analysis-by-synthesis learning paradigm, generative consistency forms the core of the self-supervised constraints used to train the viewpoint network. A synthesis function models the image generation process while an analysis function infers the parameters which best explain input image formation.
The team says theirs is the first self-supervised viewpoint learning framework they know of that learns the 3D viewpoint of general objects from in-the-wild images. They evaluated their self-supervised approach on a human head pose estimation task, where results showed the effectiveness of the framework’s self-supervised constraints, generative image consistency and symmetry constraints.
More importantly, the system’s viewpoint estimation on practical object classes such as cars, buses, and trains from the challenging Pascal 3D+ dataset demonstrated accuracy similar to that of fully-supervised approaches.
Researchers hope their study can serve as a strong baseline for further research in self-supervised viewpoint learning. It’s believed such work could help unlock the huge potential of unlabelled, in-the-wild images to train viewpoint estimation networks and more.
The paper Self-Supervised Viewpoint Learning From Image Collections is on arXiv, and the open-source code is available at GitHub.
Journalist: Fangyu Cai | Editor: Michael Sarazen