Optimization is a crucial tool to minimize error, cost, or loss when fitting a machine learning algorithm. One of the key challenges for optimizer is to find the appropriate learning rate, which is significant for the convergence speed and the accuracy of final results.
Despite the good performance of some hand-tuned optimizers, these approaches usually require tons of expert experience, as well as arduous efforts. Therefore, “parameter-free” adaptive learning rate methods, popularized by the D-Adaptation method, are gaining popularity in recent years for learning-rate-free optimization.
To further improve the D-Adaptation method, in a new paper Prodigy: An Expeditiously Adaptive Parameter-Free Learner, a research team from Samsung AI Center and Meta AI presents two novel modifications, Prodigy and Resetting, to enhance the D-Adaptation method’s worst-case non-asymptotic convergence rate, achieving faster convergence rates and better optimization outputs.
In the prodigy approach, the team improves upon the D-Adaptation by modifying its error term with Adagrad-like step sizes. In this way, the researchers have provably larger step size while preserving the main error term, which results in faster convergence rate of the modified algorithm. They also place an extra weight next to the gradients in case the algorithm become slow when the denominator in the step size grows too large over time.
Next, the team observed an unsetting fact that the convergence rate for Gradient Descent variant of Prodigy is worst then the Dual Averaging. To remedy this, In the resetting approach, the team resets the Dual Averaging process whenever the current gradient estimate increases by more than a factor of two. This resets process has three effects: 1) the step-size sequence is also reset, which results in larger step; 2) the convergence of the method is proven with respect to an unweighted average of the iterates; and 3) the value of gradient often increases more rapidly then the standard D-Adaptation estimate. As a result, it is significantly simpler to analyze in the non-asymptotic case.
In their empirical study, the team applied the proposed algorithms on both convex logistic regression and deep learning problems. Prodigy demonstrates faster adoption then other known methods across various experiments; D-Adaptation with resetting achieves the same theoretical rate as Prodigy while having a much simpler theory than Prodigy or even D-Adaptation. Moreover, both proposed approaches consistently surpass the D-Adaptation algorithm and even match the test accuracy of hand-tuned Adam.
The paper Prodigy: An Expeditiously Adaptive Parameter-Free Learner on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.