How machines “see” the physical world is rapidly evolving. The advent of self-driving technologies in particular has kickstarted explorations into the field of real-time perception and scene understanding. Although many industrial solutions currently use powerful sensors such as lidar and precision GPS to pick up rich input data, it can be interesting to see what a machine can do when given only a monocular image.
Recently, researchers from Robotics Research Center at IIIT Hyderabad, IIT Kharagpur, Mila, and Université de Montréal addressed the challenge with MonoLayout, a practical deep neural architecture that takes just a single image of a road scene as input and outputs an amodal scene layout (which can show all regions even if some regions are being occluded by other objects) in bird’s-eye view.
The researchers say MonoLayout is the first approach to amodally reason about both static and dynamic objects in a scene. Given an image containing a vehicle where for example some part of the vehicle is blocked by pedestrians, MonoLayout can reason and identify the shape of occluded parts of the vehicle. It does this by inferring the underlying geometry of objects such as humans in the street, as well as the geometry of the image elements that they have obscured.
MonoLayout captures both static scene content such as roads and dynamic content such as vehicles and pedestrians etc. using a shared context, which enables it to outperform methods trained for specific tasks.
Researchers evaluated MonoLayout performance against existing SOTA methods in the task of amodal scene layout estimation. MonoLayout bettered all evaluation benchmarks on several subsets of the KITTI (vision benchmark) and Argoverse (3D Tracking) datasets. MonoLayout also achieved SOTA performance on object detection in bird’s-eye view without using any form of threshold processing (the simplest method of segmenting image) or post-processing functions.
The researchers also demonstrated that adversarial learning can be used to improve layout estimation performance, especially when large parts of the scene are lost or missing; and that MonoLayout can be effectively trained on datasets that do not include laser scanning.
The researchers did not only present successful test cases, they also listed experiments that did not proceed as expected, in the hope that such documentation may save time for other researchers and accelerate progress in this ever-changing field.
The paper MonoLayout: Amodal Scene Layout from A Single Image is on arXiv.
Author: Herin Zhao | Editor: Michael Sarazen