Those outside academia may know Tomaso Poggio through his students, DeepMind Founder Demis Hassabis and Mobileye Founder Amnon Shashua. The former built the celebrated AI Go champion AlphaGo, while the latter has installed copilot systems in more than 15 million vehicles worldwide, and produced the world’s first L2 autonomous driving system in a car.

While Poggio the teacher has taught some extraordinary leaders in AI, Poggio the scientist is renowned for his theory of deep learning, presented in papers with self-explanatory names: Theory of Deep Learning I, II and III.

He is a Professor in the Department of Brain and Cognitive Sciences, an investigator at the McGovern Institute for Brain Research, a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and Director of the Center for Biological and Computational Learning at MIT and the Center for Brains, Minds, and Machines.

Poggio’s research focuses on three deep learning problems: 1) Representation: Why are deep neural networks better than shallow ones? 2) Optimization: Why is SGD (Stochastic Gradient Descent) good at finding minima and what are good minima? 3) Generalization: Why is it that we don’t have to worry about overfitting despite overparameterization?

Poggio uses mathematics to explain each problem before inductively working out the theory.

## Why Are Deep Neural Networks Better Than Shallow Ones?

Poggio and mathematician Steve Smale co-authored a 2002 paper that summarized classical learning theories on neural networks with one hidden layer. “Classical theory tells us to use one layer networks, while we find that the brain using many layers,” recalls Poggio.

Both deep and single-layer networks can approximate continuous functions. This was one reason why research in the 80s focused on simpler single-layer networks.

The problem occurs in the dimensionality of single-layer networks. In order to represent a complicated function, a single-layer network would require more units than the number of atoms in the universe. Mathematically, this is called “the curse of dimensionality,” wherein the number of parameters goes up exponentially corresponding to function dimensionality.

Mathematicians make assumptions about function smoothness in order to escape the curse of dimensionality. Yet deep learning offers a different approach that uses compositional functions. The units that deep networks require to approximate a compositional function share a linear relationship with function dimensionality.

Deep learning works beautifully for datasets that are compositional in nature, such as images and voice samples. Images can be broken down into related snippets of details, while voice samples can be converted into meaningful phonemes. For an image classification task, there’s no need to look at pixels that are further apart, the model simply observes each small bit and combines them. The neural network escapes the curse of dimensionality by using a very small number of parameters.

If the target is a function made up of functions with a smaller number of variables, then a deep network can approximate it with a number of units that is linear in dimensionality no matter how big the function is.

Knowing that compositional functions work well with deep networks is far from enough. “For a computer scientist or a mathematician, can we say something more about compositional functions beyond the fact that they’re compositional? Can we characterize them to get a better understanding of neural networks? This is an interesting open question,” says Poggio.

## Overparameterization and Stochastic Gradient Descent (SGD) Make Optimization Great

Deep networks have far more parameters than the number of examples in the training set.

The CIFAR dataset has 60,000 examples, and we use networks with millions of weights to process it. This is a typical case of overparameterization. We can make a hypothesis to simplify the matter: if one replaces nonlinear neurons in the deep network with univariate polynomials, then getting zero training errors on CIFAR means solving 60,000 polynomial equations. We now have infinite sets of solutions according to Bézout’s theorem, which ends up becoming the dataset’s infinite global minima.

Thus overparameterization guarantees lots of global minima that form flat valleys in the loss space. As SGD is known for its preference for flat valleys, there is a high probability that SGD will find the global minima for neural networks.

Poggio’s work showed that a combination of overparameterization and SGD simplifies the optimization of neural networks.

## Connecting Classification Tasks With Cross Entropy: A Promising Generalization Despite Overfitting

Overparameterization is good news for optimization, but a nightmare when it comes to generalization. Test errors go down but then up again in classic machine learning, which is called “overfitting.” Yet in deep learning, overfitting is not reported, and so the test error rate goes down and stays there.

Why is this the case? Poggio likens this to a “chemical reaction” that occurs when classification tasks are mixed with a specific type of loss functions called cross entropy.

Although we can use 0-1 loss to evaluate error rates, we need an alternative approach when it comes to loss function. Take handwritten digit classifiers as an example, the last step for the neural network is to turn a softmax into a “hardmax”, in other words, a class. Thus, even if we only have a bad model that is only 30% sure that the “1” we show it is a “1”, as long as 30% is the highest among the given 10 possibilities, the model will classify the image correctly. Of course no one would be satisfied with a 30% success model. The model needs further optimization, which can’t be done using a 0-1 loss.

In case of cross entropy, as long as the model is not 100% certain, one can continue to optimize it by calculating another gradient, and use backpropagation for fine-tuning. On a side note, the favourable property of using cross entropy as the loss function and 0-1 loss as error metrics is that, even when cross entropy is overfitted, the 0-1 loss will work just fine. A few months ago, University of Chicago researcher Nathan Srebro and his colleagues proved this for a special case of linear networks with separable datasets.

“On top of [Srebro’s work], we’ve shown that using differential equations from dynamical system theory will make a deep network behave like a linear network near a global minimum. We can use the Srebro result to say the same thing about deep learning, even if a deep neural network classifier has an overfitting cross entropy, the classifier wouldn’t overfit,” says Poggio.

This property of cross entropy is shared with loss functions such as exponential loss, but not with simpler ones like the least square error. Why is this the case? Poggio says this remains an unsolved question.

## Do Flat Minima Tell Us Anything About Generalization? A Change in Opinion

Poggio says his opinion on the shape of minima and their corresponding generalization capabilities has changed recently: “People said in papers that flatness is good for generalization. I also said something like this a year ago, but I don’t think this is true anymore.”

“I don’t see a direct relation between flatness and generalization. Generalization relies on properties like choosing classification as the task, choosing cross entropy as the loss function but not flatness. There’s a paper of which two out of four authors are Bengios, which proves that even sharp minima can generalize because you can change the weights in different layers to make it sharp without changing the input-output relation of the network.”

Poggio also doesn’t think it’s possible for a flat minimum to exist, at least not for neural networks that are polynomial in nature.

## Neural Network Applications: Be Careful of Overfitting

Learning deep learning theory can be enlightening, but engineers working on applications ask the question: How can theoretical research work help me train my model?

The No Free Lunch Theorem tells us that two learning algorithms are equal when no prior information is offered for distribution. For any algorithms A and B, there are as many distributions in which algorithm A outperforms B, as distributions in which B outperforms A.

Poggio utilizes the theorem to propose that in machine learning, no algorithm can work the best for every problem. “Theory tells you about the average case and the worst case, what you should or should not do to avoid bad things. But it can’t advise you on the best thing to do for any particular case.”

Poggio suggests engineers who employ deep learning models be careful of overfitting, “One lesson to learn from the past few decades of machine learning is that when you don’t have enough data, then after many trials, the state-of-the-art method is usually overfitting. It’s not because people have peeped at the validation set, it’s just that the community of researchers has tried too many different algorithms.”

“I’m a physicist by origin. When I was in school, the rule of thumb was that if you have a model or a set of equations with n parameters, you need at least 2n data points. If you want to do something statistical, the recommendation was to have 10n data points. Nowadays people use 300,000 parameters for datasets of any size. The arguments we make like ‘deep learning models tend not to overfit’ is only true for classification tasks with nice datasets, so people should be more careful about that.”

## What Does Theoretical Research Tell Us About Priors?

Humans don’t need millions of pieces of labeled data to learn, thanks to prior knowledge carried in our genes. “There is not a simple answer for how many priors we need for a model. There are only situations where we know the minimum priors needed to make predictions.”

Poggio uses regression as an example, “If I want to reconstruct a curve from points, I can’t do anything unless I have all the points. Continuity is essential but not enough, the least I need is something like smoothness. In the end, it’s a tradeoff between how strong the priors are and how much data you need.”

## What Can Neural Networks Learn From the Human Brain?

MIT has a tradition of knitting together deep learning and neuroscience, so what is Poggio’s view on learning from the human brain?

“I think it’s unlikely, but not impossible, that things like backpropagation can be done biologically, given what we know about neurons and signal processing. What I think is impossible is labeling everything.”

How our brain gets around labeling is an interesting question. Poggio assumes that our visual system for example is pre-trained to “colour-fill” an image. It receives the colour information but only gives black, grey, and white signals to the visual cortex. You do not need an oracle to tell you what the real colour is, your brain hides this part of the information, so that “colours are measured but not given to the [brain] network,” explains Poggio.

“The hope is that if you train a network to predict colour or the next image, can this network do other things easier? Can it learn to recognize objects with less data?” asks Poggio. “These are open questions that, once we get the answers, the whole deep learning community would benefit from.”

**References: **

1. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1), 1-49.

2. Neyshabur, B., Tomioka, R., Salakhutdinov, R., & Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071.

3. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. International Journal of Automation and Computing, 14(5), 503-519.

4. Liao, Q., & Poggio, T. (2017). Theory of Deep Learning II: Landscape of the Empirical Risk in Deep Learning. arXiv preprint arXiv:1703.09833.

5. Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., & Poggio, T. (2018). Theory of Deep Learning IIb: Optimization Properties of SGD. arXiv preprint arXiv:1801.02254.

6. Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., … & Mhaskar, H. (2017). Theory of Deep Learning III: explaining the non-overfitting puzzle. arXiv preprint arXiv:1801.00173.

7. Zhang, C., Liao, Q., Rakhlin, A., Sridharan, K., Miranda, B., Golowich, N., & Poggio, T. (2017). Theory of deep learning iii: Generalization properties of SGD. Center for Brains, Minds and Machines (CBMM).

8. Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933.

9. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural computation, 8(7), 1341-1390.

10. Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1(1), 67-82.

**Author: **Luna Qiu| **Editor: **Meghan Han,** **Michael Sarazen

## 0 comments on “Tomaso Poggio on Deep Learning Representation, Optimization, and Generalization”