New research from Carnegie Mellon University, Peking University and the Massachusetts Institute of Technology shows that global minima of deep neural networks can been achieved via gradient descent under certain conditions. The paper Gradient Descent Finds Global Minima of Deep Neural Networks was published November 12 on arXiv.
Finding the global minima of neural networks is a challenge that has long plagued academic researchers. It is generally believed that stochastic gradient descent in a neural network converges to the local minima. However, recent research indicates that the stochastic gradient descent method can find the global optimal solution in a neural network under certain conditions. The paper theoretically analyzes the convergence of the loss function in a fully connected architecture and residual network.
The authors start from a fully connected feedforward architecture, and show that if the hidden layers’ width m meets m = Ω (poly(n)2^O(H)), then the randomly initialized gradient descent will converge to zero training loss at a linear rate. The dependence on the number of layers meanwhile improves exponentially for ResNet architecture, demonstrating the advantage of using residual connections.
In addition to the fully connected feedforward architecture, the authors also show that in the ResNet architecture, when the width of the hidden layer m = Ω (poly(n, H)), the randomly initialized gradient will reach zero training loss. This also indirectly explains the advantages of ResNet architecture. As for the convolutional ResNet, m should be larger or equal to poly(n, p, H), where p is the number of patches.
This paper explores the relationship between the number of parameters and global minima under three different neural networks, and also demonstrates the advantages of the ResNet network. Similar work in this area comes from Microsoft, Stanford University and the University of Texas at Austin, which have submitted papers on the parameter sizes of RNN and DNN in the last two months that propose fewer parameters are required to achieve global minima.
The paper Gradient Descent Finds Global Minima of Deep Neural Networks is on arXiv.
Author: Alex Chen | Editor: Michael Sarazen