Google Brain Simplifies Network Learning Dynamics Characterization Under Gradient Descent

Machine learning models based on deep neural networks have achieved unprecedented performance on many tasks. These models are generally considered to be complex systems and difficult to analyze theoretically. Also, since it’s usually a high-dimensional non-convex loss surface which governs the optimization process, it is very challenging to describe the gradient-based dynamics of these models during training. A new Google Brain paper shows that characterizing these gradient-based dynamics can be relatively easy while the learning dynamics of wide neural networks are under gradient descent.

“In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions”. (arXiv).

Synced invited Yi Wu, a 5th year PhD candidate at UC Berkeley under the supervision of Professor Stuart Russell, to share his thoughts on Wide Neural Networks.

How would you describe Wide Neural Networks?

“Wide neural networks” is a general term referring to those neural networks with large number of hidden units per layer as opposed to “deep neural networks”, which typically have a lot of layers, but the number of parameter in each layer is small.

Why does this research matter?

First, why does this work in particular matter? It matters because despite the huge successes of applying neural networks to a variety of applications, it is still extremely important to understand why and how the neural network works. Such a fundamental problem does not only have scientific value to the whole scientific community but also could potentially provide insights for the development of new algorithms and architectures. This work is a nice step towards this direction of understanding neural networks, so it matters to the community.

Secondly, why do Wide Neural Networks matter? This is because directly analysing state-of-the-art giant neural networks is hard, so we must start from some simplifications — wide neural networks is a necessary step before we can thoroughly analyze deep neural networks and finally reach the goal of fully (and theoretically) understanding general deep learning.

What impact might this research bring to the AI community?

On a high-level, this paper provides a very nice summarization and introduction of recent advances on the theoretical understanding of wide neural networks, including Jacot et al., Du et al., Allen-Zhu et al. Moreover, this paper presents very comprehensive experimental studies on real data and neural networks, which justifies the theoretical results. For the general community, this paper provides clear insights and can be viewed as a nice tutorial on recent advances on understanding deep learning. For the learning community, the positive experiment results can verify all the efforts and progress in this field as well as push the frontier of theoretical analysis of neural networks.

Can you identify any bottlenecks in the research?

Although the theoretical analysis is very clear and thorough, the main contribution and novelty of this work is the experimental study. The theoretical part does not provide any new theorems while it looks more like a (very nice) summary. Regarding the experiments, although most of the results are positive, we can still notice that in figure 5&6, as there are more training iterations, a gap might appear between the real neural network and the theoretical prediction (solid line and dashed line). Moreover, it seems that the training is not yet converged on the multi-class classification experiment (figure 6, bottom-middle). One might suspect whether the gap might become large on multi-class setting during convergence. To narrow such a gap between theory and practice might be worth further studies in the future.

Can you predict any potential future related to this research?

Deep learning theory is a remarkably popular area in machine learning recently, initially from the study of two-layer neural networks (e.g., Yuandong Tian, Li et al., Ge et al.), to wide neural networks (see references above). Recently, there is even theoretical progress on recurrent neural networks (Allen-Zhu et al.) and deep Q networks (Jin et al.). Although there is still a gap between our current best theoretical analysis and our state-of-the-art deep learning models, I am confident that in the future we could eventually have a complete theoretical foundation for deep learning in practice.

The paper Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent is on arXiv.

Yi Wu is a 5th year PhD candidate at UC Berkeley under the supervision of Professor Stuart Russell. He is also currently a visiting research at OpenAI Inc. In Spring 2020, Yi will join the Tsinghua University Institute of Interdisciplinary Information Sciences as a tenure-track assistant professor. Yi’s research interests include deep reinforcement learning, natural language processing and probabilistic programming. His representative works include Value Iteration Network (Best paper at NIPS 2016), MADDPG and the House3D project.

Synced Insight Partner Program

The Synced Insight Partner Program is an invitation-only program that brings together influential organizations, companies, academic experts and industry leaders to share professional experiences and insights through interviews and public speaking engagements, etc. Synced invites all industry experts, professionals, analysts, and others working in AI technologies and machine learning to participate.

Simply Apply for the Synced Insight Partner Program and let us know about yourself and your focus in AI. We will give you a response once your application is approved.

Google Brain Simplifies Network Learning Dynamics Characterization Under Gradient Descent

Like this:

0 comments on “Google Brain Simplifies Network Learning Dynamics Characterization Under Gradient Descent”

Leave a Reply Cancel reply

Related

Share this:

Like this:

0 comments on “Google Brain Simplifies Network Learning Dynamics Characterization Under Gradient Descent”

Leave a Reply Cancel reply

Related