In an ancient Indian parable, six blind men encounter an elephant but struggle to discern what it is. One taps the beast’s side, guessing it could be a wall. Gripping a leg, another imagines a tree trunk, while a third taps the pointy tusks and fears a threatening spear. Such challenges continue in contemporary computer vision research, where combining data from different viewpoints to reconstruct a 3D structure is known as Structure from Motion (SfM).
The new paper Image Matching across Wide Baselines: From Paper to Practice proposes a comprehensive benchmark for such systems’ local features and robust estimation algorithms, based on 3D reconstruction accuracy.
Many existing studies use images captured under controlled conditions and do not reflect complex real world conditions. Because vast amounts of unstructured images are freely available on the internet, researchers sought a way to use these to expand the diversity of sites captured with different viewpoints and other conditions.
Google Research collaborated with a team of researchers from UVIC, CTU, and EPFL on the new benchmark for wide-baseline image matching, which includes a 30k image dataset with depth maps and accurate pose information. The entire project is open source.
The researchers propose the benchmark as a way to evaluate the performance of existing methods and the shortcomings of current local feature algorithms for SfM. Until now, such comparisons had been hindered by the lack of ground truth data.
Typically, reconstructing 3D scenes starts with identifying image sections with the same physical points in a scene, for example the distinctive corners of Rome’s Trevi Fountain. Using local salient image features that can be reliably identified across different views enabled researchers to obtain likely correspondences between the pixel coordinates across two or more images and reconstruct an object’s 3D form. A Google AI post explains that “doing this over many images and points allows one to obtain very detailed reconstructions.”
It’s also important to create a strong dataset with images captured under various imaging conditions and on a variety of devices. This enables researchers to build robust models that perform well under a wide range of situations. Photo tourism images of famous landmarks perfectly meet this requirement, reflecting complex reality through the vast amount of publicly available images with different viewpoints, lighting, occlusions, etc.
The team sourced photo-tourism images from the public Yahoo Flickr Creative Commons 100M Dataset and created The Phototourism Dataset, comprising images augmented with depth maps and accurate ground truth pose information including location and orientation.
The benchmark pipeline is modular, incorporating classical and state-of-the-art methods for feature extraction, image matching and pose estimation. The pipeline takes images of a scene as input, extracts features, then computes matches for all image pairs. The matches can then be processed by an outlier pre-filtering module before they are fed to downstream tasks for evaluation.
The researchers note that many new methods for SfM sub-problems are only studied in isolation and using intermediate metrics. Although this simplifies evaluation it also limits how a pipeline can improve final application. Instead of using intermediate performance metrics, the researchers chose downstream reconstructed camera pose accuracy as the primary metric: “this is particularly crucial now, with deep networks seemingly outperforming algorithmic solutions on classical problems.”
The team is going to hold a 2020 Image Matching Challenge at CVPR 2020.
The paper Image Matching across Wide Baselines: From Paper to Practice is on arXiv.
Author: Fangyu Cai | Editor: Michael Sarazen