The optimization algorithm (or optimizer) is the main approach used today for training a machine learning model to minimize its error rate. There are two metrics to determine the efficacy of an optimizer: speed of convergence (the process of reaching a global optimum for gradient descent); and generalization (the model’s performance on new data). Popular algorithms such as Adaptive Moment Estimation (Adam) or Stochastic Gradient Descent (SGD) can capably cover one or the other metric, but researchers can’t have it both ways.
A paper recently accepted for ICLR 2019 challenges this with a novel optimizer — AdaBound — that authors say can train machine learning models “as fast as Adam and as good as SGD.” Basically, AdaBound is an Adam variant that employs dynamic bounds on learning rates to achieve a gradual and smooth transition to SGD.
A conference reviewer of the paper Adaptive Gradient Methods with Dynamic Bound of Learning Rate commented “Their approach to bound is well structured in that it converges to SGD in the infinite limit and allows the algorithm to get the best of both worlds – faster convergence and better generalization.”
Adam vs SGD
To better understand the paper’s implications, it is necessary to first look at the pros and cons of popular optimization algorithms Adam and SGD.
Gradient descent is the most common method used to optimize deep learning networks. First proposed in the 1950s, the technique can update each parameter of a model, observe how a change would affect the objective function, choose a direction that would lower the error rate, and continue iterating until the objective function converges to the minimum.
SGD is a variant of gradient descent. Instead of performing computations on the whole dataset — which is redundant and inefficient — SGD only computes on a small subset or random selection of data examples. SGD produces the same performance as regular gradient descent when the learning rate is low.
In the recent years however, a number of new optimizers have been proposed to tackle complex training scenarios where gradient descent methods behave poorly. One of the most widely used and practical optimizers for training deep learning models is Adam. Tesla AI Director Andrej Karpathy estimated in his 2017 blog post A Peek at Trends in Machine Learning that Adam appears in about 23 percent of academic papers: “It’s likely higher than 23% because some papers don’t declare the optimization algorithm, and a good chunk of papers might not even be optimizing any neural network at all.”
Essentially Adam is an algorithm for gradient-based optimization of stochastic objective functions. It combines the advantages of two SGD extensions — Root Mean Square Propagation (RMSProp) and Adaptive Gradient Algorithm (AdaGrad) — and computes individual adaptive learning rates for different parameters. (To learn more about Adam, Synced recommends Adam — latest trends in deep learning optimization.)
Despite the widespread popularity of Adam, recent research papers have noted that it can fail to converge to an optimal solution under specific settings. The paper Improving Generalization Performance by Switching from Adam to SGD demonstrates that adaptive optimization techniques such as Adam generalize poorly compared to SGD. This has prompted some researchers to explore new techniques that may improve on Adam
What is Adabound?
Here are fast takeaways from the paper:
- The paper authors first argued that the lack of generalization performance of adaptive methods such as Adam and RMSPROP might be caused by unstable and/or extreme learning rates. They also suggested the modest learning rates of adaptive methods can lead to undesirable non-convergence.
- Researchers suggested that AmsGrad, a recent optimization algorithm proposed to improve empirical performance by introducing non-increasing learning rates, neglects the possible effects of small learning rates.
- The paper introduces new variants of Adam and AmsGrad: AdaBound and AmsBound, respectively. These employ dynamic bounds on learning rates in adaptive optimization algorithms, where the lower and upper bounds are initialized as zero and infinity respectively, and both smoothly converge to a constant final step size.
- The authors conducted experiments on several standard benchmarks, including feedforward neural networks, convolutional neural networks (DenseNet and ResNet on CIFAR 10), and recurrent neural networks (1-layer, 2-layer and 3-layer LSTM on Penn Treebank).
- AdaBound and AmsBound achieved the best accuracy in most test sets when compared to other adaptive optimizers and SGD, while maintaining relatively fast training speeds and hyperparameter insensitivity. The experiment results also demonstrate that the AdaBound and AmsBound improvements are related to the complexity of the architecture.
One paper reviewer suggested “the paper could be improved by including more and larger data sets. For example, the authors ran on CIFAR-10. They could have done CIFAR-100, for example, to get more believable results.”
The paper’s lead author Liangchen Luo (骆梁宸) and second author Yuanhao Xiong (熊远昊) are undergraduate students at China’s elite Peking and Zhejiang Universities respectively. Luo has also has three publications accepted by top AI conferences EMNLP 2018 and AAAI 2019.
Journalist: Tony Peng | Editor: Michael Sarazen