In a recent paper, a team of researchers from Stanford University and MIT introduce a novel Physical Scene Graphs (PSG) approach designed to obtain a better structured understanding of visual scenes.
In the vibrant field of computer vision (CV), convolutional neural networks (CNNs) have excelled at learning representations for visual object classification. However, their lack of physical understanding — such as predicting object dynamics — has limited CNN performance on many other computer vision tasks.
To overcome these limitations the team proposed PSGNet, a self-supervised neural network architecture that learns to estimate PSGs from visual inputs. The neural network takes in RGB movies of any length and outputs RGB reconstructions of the same length along with estimates of depth and normals maps for each frame, the (self-supervised) object segment map for each frame, and the (self-supervised) map of the estimated next-frame RGB deltas at each spatial location in each frame, explains the paper Learning Physical Graph Representations from Visual Scenes.
The researchers propose that achieving human-level visual scene understanding in CV models will be difficult if CNNs don’t explicitly represent scene structures. Such structures can include discrete objects, part-whole relationships, or physical properties. “A hierarchical graph is natural for representing such structure,” Daniel Bear, first author of the paper and a Postdoctoral Research Fellow at Stanford University, tweeted. “Nodes correspond to discrete scene elements (objects and parts), edges to physical linkage (grouping parts together), and node data to visual or physical properties. We call this a “Physical Scene Graph” (PSG).“
PSGNet uses a three-stage process to build PSGs: feature extraction, graph construction, and graph rendering. In the first stage, input video passes through a convolutional recurrent neural network (ConvRNN) so that high- and low-level visual information can be efficiently combined. The ConvRNN features are then used as the base tensor for the next stage, where a spatiotemporal PSG is constructed. In this stage, PSGNets can “address two problems that ConvNets aren’t cut out for. (A) Grouping scene elements together w/o supervision and (B) aggregating information over spatial regions that vary from scene to scene,” Bear tweeted.
The team employs a pair of learnable modules, Graph Pooling and Graph Vectorization, to tackle these challenges. In the last stage, the PSGs pass through a decoder that renders them back into feature maps.
In their experiments, the team says PSGNet substantially outperformed alternative unsupervised approaches to scene description at segmentation tasks, especially for real-world images. In the future, the researchers intend to look at applying PSGs for physical tasks that ConvNets are currently unable to perform well, such as predicting the dynamics of colliding objects from visual input.
The paper Learning Physical Graph Representations from Visual Scenes is on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.