‘Beyond the ConvNet’ -Stanford & MIT Neural Network Learns Physical Graph Representations from Video

In a recent paper, a team of researchers from Stanford University and MIT introduce a novel Physical Scene Graphs (PSG) approach designed to obtain a better structured understanding of visual scenes.

In the vibrant field of computer vision (CV), convolutional neural networks (CNNs) have excelled at learning representations for visual object classification. However, their lack of physical understanding — such as predicting object dynamics — has limited CNN performance on many other computer vision tasks.

To overcome these limitations the team proposed PSGNet, a self-supervised neural network architecture that learns to estimate PSGs from visual inputs. The neural network takes in RGB movies of any length and outputs RGB reconstructions of the same length along with estimates of depth and normals maps for each frame, the (self-supervised) object segment map for each frame, and the (self-supervised) map of the estimated next-frame RGB deltas at each spatial location in each frame, explains the paper Learning Physical Graph Representations from Visual Scenes.

The researchers propose that achieving human-level visual scene understanding in CV models will be difficult if CNNs don’t explicitly represent scene structures. Such structures can include discrete objects, part-whole relationships, or physical properties. “A hierarchical graph is natural for representing such structure,” Daniel Bear, first author of the paper and a Postdoctoral Research Fellow at Stanford University, tweeted. “Nodes correspond to discrete scene elements (objects and parts), edges to physical linkage (grouping parts together), and node data to visual or physical properties. We call this a “Physical Scene Graph” (PSG).“

PSGNet uses a three-stage process to build PSGs: feature extraction, graph construction, and graph rendering. In the first stage, input video passes through a convolutional recurrent neural network (ConvRNN) so that high- and low-level visual information can be efficiently combined. The ConvRNN features are then used as the base tensor for the next stage, where a spatiotemporal PSG is constructed. In this stage, PSGNets can “address two problems that ConvNets aren’t cut out for. (A) Grouping scene elements together w/o supervision and (B) aggregating information over spatial regions that vary from scene to scene,” Bear tweeted.

The team employs a pair of learnable modules, Graph Pooling and Graph Vectorization, to tackle these challenges. In the last stage, the PSGs pass through a decoder that renders them back into feature maps.

In their experiments, the team says PSGNet substantially outperformed alternative unsupervised approaches to scene description at segmentation tasks, especially for real-world images. In the future, the researchers intend to look at applying PSGs for physical tasks that ConvNets are currently unable to perform well, such as predicting the dynamics of colliding objects from visual input.

The paper Learning Physical Graph Representations from Visual Scenes is on arXiv.

Journalist: Fangyu Cai | Editor: Michael Sarazen

We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

12 comments on “‘Beyond the ConvNet’ -Stanford & MIT Neural Network Learns Physical Graph Representations from Video”

zianezo

2020-07-08

Thank you

Loading...

Reply
zianezo

2020-07-08

Thanks for topic

Loading...

Reply
- zianezo
  
  2020-07-08
  
  That topic very good
  
  Loading...
  
  Reply
zianezo

2020-07-08

Good good

Loading...

Reply
zianezo

2020-07-08

Nice topic

Loading...

Reply
zianezo

2020-07-08

Very very good and thanks

Loading...

Reply
zianezo

2020-07-08

I like to

Loading...

Reply
zianezo

2020-07-08

Thank you very much

Loading...

Reply
- zianezo
  
  2020-07-08
  
  I like to
  
  Loading...
  
  Reply
zianezo

2020-07-08

I like for topic

Loading...

Reply
zianezo

2020-07-08

Thanks i love for topic

Loading...

Reply
zianezo

2020-07-08

I techour for topic

Loading...

Reply

‘Beyond the ConvNet’ -Stanford & MIT Neural Network Learns Physical Graph Representations from Video

Like this:

12 comments on “‘Beyond the ConvNet’ -Stanford & MIT Neural Network Learns Physical Graph Representations from Video”

Leave a Reply to zianezo Cancel reply

Related

Share this:

Like this:

12 comments on “‘Beyond the ConvNet’ -Stanford & MIT Neural Network Learns Physical Graph Representations from Video”

Leave a Reply to zianezo Cancel reply

Related