A Google-led research team has introduced a new method for optimizing neural network parameters that is faster than all common first-order methods on complex problems. The researchers say their design and implementation is the first to demonstrate the power and scalability of second-order optimization methods in practice.
Commonly used optimizers such as Adam, AdaGrad, SGD + Momentum, etc. are all first-order gradient methods and show the best performance among SOTA optimizers. Although the second-order optimization method has a stronger theoretical foundation with much faster convergence speed, its application has been limited by prohibitively high computation, memory, and communication costs. The newly revised Shampoo second-order optimizer makes this theoretically promising idea practical.
The proposed model is based on the 2018 Shampoo structure-aware preconditioning algorithm for stochastic optimization over tensor spaces. The Google Brain/Research, Tel Aviv University and Princeton University team made a number of major adjustments.
The researchers replaced the expensive spectral decompositions (SVD) that Shampoo used to perform preconditioner computation with an economical iterative algorithm called Schur-Newton. They also reduced computing costs by utilizing idle CPUs that also featured automatic double-precision arithmetic to handle preconditioner computation while the accelerators were looping in the training process. Preconditioners were computed every N steps and this computation was distributed to all CPU cores available in the training system.
The team also made two simple alterations to accommodate larger models. They decoupled the step size and direction, which they achieved by adopting inexpensive diagonal AdaGrad and having it compute concurrently with Shampoo; and they bypassed preconditioning of excessively large dimensions to retain the benefits of preconditioning for large layers.
On a standard machine translation dataset going from English to French, the optimized algorithm achieved the same performance as first-order gradient methods AdaGrad and Adam, but in half as many steps. The algorithm converges 1.95x faster in steps, while being only about 16 percent slower per step, enabling the proposed method to attain a particular log-perplexity in 40 percent less wall-clock time.
After Google AI Lead Jeff Dean praised the paper on Twitter, first author Rohan Anil followed up: “One of the limiting aspects for us was the fragmented set of baselines across infrastructure and apis which made our progress slow. Adoption of this optimizer is heavily dependent on how much we can make it turn-key for practitioners… Our hope is to inspire others to implement and try out this optimizer (and improve it) on a broader set of domains (for eg. GANs, RL etc) and infrastructure and frameworks!”
The paper Second Order Optimization Made Practical is on arXiv.
Author: Reina Qi Wan | Editor: Michael Sarazen