This is an updated version.
The Godfathers of AI and 2018 ACM Turing Award winners Geoffrey Hinton, Yann LeCun, and Yoshua Bengio shared a stage in New York on Sunday night at an event organized by the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020). The trio of researchers have made deep neural networks a critical component of computing, and in individual talks and a panel discussion they discussed their views on current challenges facing deep learning and where it should be heading.
Introduced in the mid 1980s, deep learning gained traction in the AI community the early 2000s. The year 2012 saw the publication of the CVPR paper Multi-column Deep Neural Networks for Image Classification, which showed how max-pooling CNNs on GPUs could dramatically improve performance on many vision benchmarks; while a similar system introduced months later by Hinton and a University of Toronto team won the large-scale ImageNet competition by a significant margin over shallow machine learning methods. These events are regarded by many as the beginning of a deep learning revolution that has transformed AI.
Deep learning has been applied to speech recognition, image classification, content understanding, self-driving, and much more. And according to LeCun — who is now Chief AI Scientist at Facebook — the current services offered by Facebook, Instagram, Google, and YouTube are all built around deep learning.
Deep learning does however have its detractors. Johns Hopkins University Professor and one of the pioneers of computer vision Alan Yuille warned last year that deep learning’s potential in computer vision has hit a bottleneck.
“We read a lot about the limitations of deep learning today, but most of those are actually limitations of supervised learning,” LeCun explained in his talk. Supervised learning typically refers to learning with labelled data. LeCun told the New York audience that unsupervised learning without labels — or “self-supervised learning” as he prefers to call it — may be a game changer that ushers in AI’s next revolution.
“This is an argument that Geoff [Hinton] has been making for decades. I was skeptical for a long time but changed my mind,” said LeCun.
Hinton: Move on From CNN and Look at Capsule Autoencoders
There are two approaches to object recognition. There’s the good old-fashioned path based approach, with sensible modular representations, but this typically imposes a lot of hand engineering. And then there are convolutional neural nets (CNNs), which learn everything end to end. CNNs get a huge win by wiring in the fact that if a feature is good in one place, it’s good somewhere else. But their approach to object recognition is very different from human perception.
This informed the first part of Hinton’s talk, which he personally directed at LeCun: “It’s about the problems with CNNs and why they’re rubbish.”
CNNs are designed to cope with translations, but they’re not so good at dealing with other effects of changing viewpoints such as rotation and scaling. One obvious approach is to use 4D or 6D maps instead of 2D maps — but that is very expensive. And so CNN are typically trained on many different viewpoints in order for them to be able to generalize across viewpoints. “That’s not very efficient,” Hinton explained. “We’d like neural nets to generalize to new viewpoints effortlessly. If it learned to recognize something, then you make it 10 times as big and you rotate it 60 degrees, it shouldn’t cause them any problem at all. We know computer graphics is like that and we’d like to make neural nets more like that.”
Hinton believes the answer is capsules. A capsule is a group of neurons that learns to represent a familiar shape or part. Hinton says the idea is to build more structure into neural networks and hope that the extra structure helps them generalize better. Capsules are an attempt to correct the things that are wrong with CNNs.
The capsules Hinton introduced are Stacked Capsule Auto-encoders, which first appeared at NeurIPS 2019 and are very different in many ways from previous capsule versions from ICLR 2018 and NIPS 2017. These had used discriminative learning. Hinton said even at the time he knew this was a bad idea: “I always knew unsupervised learning was the right thing to do — so it was bad faith to do the previous models.” The 2019 capsules use unsupervised learning.
LeCun: I Was Wrong; Dump Supervised Learning Now and Try Self-Supervised Learning
LeCun noted that although supervised learning has proven successful in for example speech recognition and content understanding, it still requires a large amount of labelled samples. Reinforcement learning works great for games and in simulations, but since it requires too many trials it’s not really applicable in the real world.
The first challenge LeCun discussed was how models can be expected to learn more with fewer labels, fewer samples or fewer trials.
LeCun now supports the unsupervised learning (self-supervised learning) solution Hinton first proposed some 15 years ago. “Basically it’s the idea of learning to represent the world before learning a task — and this is what babies do,” LeCun explained, suggesting really figuring out how humans learn so quickly and efficiently may be the key that unlocks self-supervised learning’s full potential going forward.
Self-supervised learning is largely responsible for the success of natural language processing (NLP) over the last year and a half or so. The idea is to show a system a piece of text, image, or video input, and train a model to predict the piece that’s missing — for example to predict missing words in a text, which is what transformers and BERT-like language systems were built to do.
But success of Transformers and BERT et al has not transferred into the image domain because it turns out to be much more difficult to represent uncertainty in prediction on images or in video than it is in text because it’s not discrete. It’s practical to produce distributions over all the words in a dictionary, but it’s hard to represent distributions over all possible video frames. And this is, in LeCun’s view, “the main technical problem we have to solve if we want to apply self-supervised learning to a wider variety of modalities like videos.”
LeCun proposed one solution may be in latent variable energy-based models: “An energy-based model is kind of like a probabilistic model except you don’t normalize. And one way to train the energy-based model is to give low energy to samples that you observe and high energy to samples you do not observe.”
In his talk, LeCun touched on two other challenges:
- How to make reasoning compatible with basically gradient-based learning
- How to learn to plan complex action sequences — decomposing a complex task into sub-tasks
LeCun opined that nobody currently seems to have a good answer to either of these two challenges, and said he remains open to and looks forward to any possible ideas.
Bengio: It’s Time to Explore Consciousness
Yoshua Bengio, meanwhile, has shifted his focus to consciousness. After cognitive neuroscience, he believes the time is ripe for ML to explore consciousness, which he says could bring “new priors to help systematic and good generalization.” Ultimately, Bengio hopes such a research direction could allow DL to expand from “System 1 to System 2” — referring to a dichotomy introduced by Daniel Kahneman in his book Thinking, Fast and Slow. System 1 represents what current deep learning is very good at — intuitive, fast, automatic, anchored in sensory perception. System 2 meanwhile represents rational, sequential, slow, logical, conscious, and expressible with language.
Before he dived into the valuable lessons that can be learned from consciousness, Bengio briefed the audience on cognitive neuroscience. “It used to be seen in the previous century that working on consciousness was kind of taboo in many sciences for all kinds of reasons. But fortunately, this has changed and particularly in cognitive neuroscience. In particular, the Global Workspace Theory by Baars and the recent work in this century based on DeHaene, which really established these theories to explain a lot of the objective neuroscience observations.”
Bengio likened conscious processing to a bottleneck and asked “Why would this (bottleneck) be meaningful? Why is it that the brain would have this kind of bottleneck where information has to go through this bottleneck, just a few elements to be broadcast to the rest of the brain? Why would we have a short term memory that only contains like six or seven elements? It doesn’t make sense.”
Bengio said “the bottom line is get the magic out of consciousness” and proposed the consciousness prior, a new prior for learning representations of high-level concepts of the kind human beings manipulate with language. The consciousness prior is inspired by cognitive neuroscience theories of consciousness. “This prior can be combined with other priors in order to help in disentangling abstract factors from each other. What this is saying is that at that level of representation, our knowledge is represented in this very sparse graph where each of the dependencies, these factors involve two, three, four or five entities and that’s it.”
Consciousness can also provide inspiration on how to build models. Bengio explained “Agents are at the particular time at a particular place and they do something and they have an effect. And eventually that effect could have constant consequences all over the universe, but it takes time. And so if we can build models of the world where we have the right abstractions, where we can pin down those changes to just one or a few variables, then we will be able to adapt to those changes because we don’t need as much data, as much observation in order to figure out what has changed.”
So what’s required if deep learning is going to reach human-level intelligence? Bengio referenced his previous suggestions, that missing pieces of the puzzle include:
- Generalize faster from fewer examples
- Generalize out-of-distribution, better transfer learning, domain adaptation, reduce catastrophic forgetting in continual learning
- Additional compositionality from reasoning and consciousness
- Discover casual structures and exploit them
- Better models of the world, including common sense
- Exploit the agent perspective from RL, unsupervised exploration
AI Minds Converge
In a panel discussion, Hinton, LeCun and Bengio were asked how they reconcile their research approaches with colleagues committed to more traditional methods. Hinton had been conspicuously absent from some AAAI conferences, and hinted at why in responding: “The last time I submitted a paper to AAAI, I got the worst review I ever got. And it was mean. It said ‘Hinton has been working on this idea for seven years [vector representations] and nobody’s interested. Time to move on.’
Hinton spoke of his efforts to find a common ground and move on: “Right now we’re in a position where we should just say, let’s forget the past and let’s see if we can take the idea of doing gradient descent in great big system parameters. And let’s see if we can take that idea, because that’s really all we’ve discovered so far. That really works. The fact that that works is amazing. And let’s see if we can learn to do reasoning like that.”
Author: Fangyu Cai & Yuan Yuan | Editor: Michael Sarazen