In a new paper, a research team lead by Geoffrey Hinton combines the strengths of five advances in neural networks — Transformers, Neural Fields, Contrastive Representation Learning, Distillation and Capsules — to imagine the idea of a vision system, “Glom,” that enables neural networks with fixed architectures to parse an image into a part-whole hierarchy with different structures for each image.
Psychological evidence shows that humans parse visual scenes into part-whole hierarchies and model the viewpoint-invariant spatial relationship between a part and a whole. While neural networks do have the ability to represent part-whole hierarchies, it is difficult to make them do so, as each image has a different parse tree and neural networks cannot dynamically allocate neurons to represent a node in a parse tree. Instead, what a neuron does is determined by the slowly changing weights on its connections.
GLOM, derived from the slang term ”glom together,” is proposed to solve this issue and enable static neural nets to represent dynamic parse trees. Take this case as an intuitive example: One patch in an image contains parts of objects of class A and B, and the other patch contains parts of objects of class A and C. In this case, traditional neural nets will fail to represent the image. GLOM, however, could discover the spatial coherence and represent the part-whole hierarchies of this type of image.
The GLOM architecture comprises a large number of columns, where each column is a stack of spatially local autoencoders that learn multiple levels of representation. Each autoencoder transforms the embedding from one level into an adjacent level using a multilayer bottom-up encoder and a multilayer top-down decoder. The embedding vector for a location corresponds to different levels of a part-whole hierarchy.
At each level there are islands (representing the parse tree) of agreement. The paper proposes the level L embedding at a location as an average of four attributes:
- the prediction produced by the bottom-up neural net acting on the embedding at the level below at the previous time
- the prediction produced by the top-down neural net acting on the embedding at the level above at the previous time
- the embedding vector at the previous time step
- the attention-weighted average of the embeddings at the same level in nearby columns at the previous time
The team also discusses GLOM’s design decisions and details, answering questions such as: How many levels are there? How fine-grained are the locations? Does the bottom-up net look at nearby locations? How does the attention work? What are the visual inputs? and so on.
Finally, the researchers analyze and explain how GLOM excels compared to other neural network models such as capsule models, transformer models, convolutional neural networks, etc.
The reaction on social media has been swift, with some finding amusement in the abstract’s disclaimer, “This paper does not describe a working system,” and Dutch AI entrepreneur Tarry Singh applauding Hinton’s dedication: “A True researcher – Always loved Geoff for this.” Hinton’s tweet announcing the paper picked up over 1,700 Likes in just a few hours.
Hinton is joined on the GLOM project by researchers from the Vector Institute and the University of Toronto Department of Computer Science. The paper How to Represent Part-Whole Hierarchies in a Neural Network is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.