# A Look at ‘Escaping From Saddle Points on Riemannian Manifolds’

A new paper from the University of Washington Seattle and the University of California, Berkeley looks at saddle points on Riemannian Manifolds. In this article Synced takes a deep dive into this important research.

In mathematics, saddle points are a place of convergence — where a function has reached a local maximum in one direction and a local minimum in another. The visual mapping of a saddle point resembles a resting place something like a horse’s saddle, hence the name. In deep learning however, algorithms can sometimes settle in on saddle points more than researchers would like. A new paper from the University of Washington and the University of California, Berkeley looks at saddle points on Riemannian Manifolds. The paper, Escaping From Saddle Points on Riemannian Manifolds was presented at NeurIPS 2019, and in this article Synced takes a deep dive into this important research.

The study of optimization typically implies maximizing or minimizing some functions within some sets. This set often represents a range of choices available subject to constraints. The function allows comparison of the different choices (within the set) for determining which might be “best.”

Learning on the other hand, is the procedure in which a model iteratively learns to minimize some error function, or maximize some reward function for a set of inputs. Take a simple linear regression for classification as an example. Considering the error function to be the mean square error between the output of the model and the true output of the data, the learning procedure is to find the coefficients a_i and b_i of some linear function y = a^Tx + b, such that the error between y (the output of the model) and y* (the true outputs) is minimized. For instance, learning (i.e. optimization) is usually done iteratively through backpropagation using gradient descent algorithms. At every iteration, the coefficients a_i and b_i are the choices (within a set of possible a_i and b_i values), and the algorithm will learn where the next set of coefficients minimize the error function. Thus, learning is ultimately an optimization problem. The new University of Washington and University of California, Berkeley paper takes a deeper look into the optimization problem details which are important for the underlying mathematics of machine learning.

Gradient-based optimization methods are the most popular choice for finding local optima for classical minimization. Getting stuck at a saddle point is an issue when it comes to the gradient-based optimization. The optimization problem aims to find a stationary point at which the objective function is at a local minima. Saddle points however, are stable stationary points that are not local optima. Thus, it is important to learn how to recognize saddle points and escape from them. This paper presents a novel method of doing just that for objective functions subject to manifold constraints.

### Setup

The paper considers minimizing a non-convex smooth function on a smooth manifold M.

where M is a d-dimensional smooth manifold, and f is twice differentiable, with a Hessian that is ρ-Lipschitz. Finding global minimums for such problems is hard, and therefore, the goal of this paper is to find an approximate second order stationary point that is a local minima using first order optimization methods. Upon reaching a stationary point, the authors introduce a method to identify whether the stationary point is a saddle point or local minima. In the case of a saddle point, the authors discuss a method to escape such saddle points and attempt to converge to a local minima.

With the increasing number of applications that can be modeled as large-scale optimization problems, simpler first-order methods have increased importance. Thus, this paper explores how first-order methods fare when applied to non-convex problems. Specifically, the authors examine the optimization of these non-convex problems subject to Riemannian manifolds.

### Background preliminaries

Before we go deeper, it’s important to understand some of the underlying mathematical concepts. Ideally, this paper requires the reader to have a basic understanding of Gaussian geometry i.e. the geometry of curves and surfaces in 3-dimensional Euclidean space. In addition, some knowledge on differential geometry is quite important. However, I will try to explain the significance of some of the terms used in this paper. If the reader is interested, I will leave a few references on the building blocks towards the end.

Every smooth d-dimensional manifold M is locally diffeomorphic to R^n. Every point in M has a (small) neighborhood around which is a flat. Thus, it admits a metric which is isometric to the Euclidean metric on R^n. Visually, this means every point on M has around it, a small neighborhood that has zero curvature and a Euclidean metric.

Next we need the notion of a tangent space TxM of a differentiable manifold M at a point x in M. The tangent space is just a vector space of the same dimension as M. There is a concept that the readers should familiarize themselves with at this point. In a standard R^n, a tangent vector v at a point x ∈ R^n can be interpreted as a first order linear differential operator when acting on real-valued functions defined locally around x. This can be generalized to a manifold setting.

Next we discuss the Riemannian manifold. It is a manifold with a Riemannian metric defined. A Riemannian metric provides us with a scalar product on each tangent space and can be used to measure angles and the lengths of curves on the manifold. This defines a distance function and turns the manifold into a metric space in a natural way.

Assuming the reader is familiar with the notion of curvature, we can introduce the Riemann curvature tensor and the sectional curvatures of a Riemannian manifold. The Riemann curvature tensor can be understood as a notion of curvature for a manifold, and agrees with the intuition we have about curvature.

Curvature

The definition of sectional curvature can be found in standard sources. But the idea of sectional curvature is to assign curvatures to planes. The sectional curvature of a plane P in the tangent space is proportional to the Gaussian curvature of the surface swept out by geodesics with starting directions in P. Intuitively you’re taking a two-dimensional slice through a given plane, and measuring the classical two-dimensional curvature of this slice.

### Notation background

This paper considers a smooth, d-dimensional Riemannian manifold (M,g), equipped with a Riemannian metric g. The tangent space of a point x ∈ M is denoted TxM. The notion of a neighborhood is defined by Bx(r) = { v ∈ TxM, ||v|| ≤ r }, which is a ball with a radius r in TxM centered at 0. At any point x ∈ M, the metric g induces a natural inner product on the tangent space denoted by < · , · > : TxM x TxM → R. The Riemannian curvature tensor is denoted by R(x)[u, v], where x ∈ M, u, v ∈ TxM. The sectional curvature is denoted K(x)[u, v] for x ∈ M, u, v ∈ TxM and is defined as

and denotes the distance (induced by the Riemannian metric) between two points in M by d(x, y). A geodesic γ : R → M is a constant speed curve whose length is equal to d(x, y). That is, the geodesic is the shortest path on manifold linking x and y.

The exponential map Exp_x (v) maps v ∈ TxM to y ∈ M such that there exists a geodesic γ with γ(0) = x, γ(1) = y and (d/dt)γ(0) = v. This is a hard concept to convey, but think of the exponential map as follows. We can consider this map as pushing a point x a short distance in the direction of the tangent vector v. If we can push this point a total of 1 unit of distance along the geodesic γ, we will reach point y.

Another way to think about the exponential map is the idea of projection. That is, consider a point p1 with coordinates (x_1, y_1, z_1), where z_1 = 0, i.e., p1 lies on the x-y plane. Add a vector p2 with (x_2, y_2, z_2) to p1 with z_2 not equal to zero, i.e., p3 = p1+p2. Now, using the projection theorem, we project p3 back onto the x-y plane to get p4. p4 by the projection theorem, will have the shortest distance to p1. A geodesic serves as the shortest possible line between two points on a sphere or other curved surface, or in this case, the manifold. The exponential map is analogous to the projection operator in this example.

## Main Algorithm

The main algorithm for optimizing on a Riemannian manifold is as follows:

Algorithm 1 may look incredibly intimidating, but the summary given by the authors is quite useful for understanding the mechanisms of the algorithm. The algorithm works as follows:

Algorithm 1 relies on the manifold’s exponential map, and is useful for cases where this map is easy to compute (true for many common manifolds). The authors show this with a visualization.

The function f is defined as f(x) = (x_1)^2 − (x_2)^2 + 4(x_3)^2, and is plotted on Figure 1. The algorithm initializes at x_0, where it is a saddle point. Noise is added to x_0 in its tangent space, and an exponential map is computed to map the point back to a point x_1 on the manifold. Then the algorithm runs Riemannian gradient descent, and terminates at x*, which is a local minimum.

## Concepts behind the Main Theorem

The idea behind the algorithm is quite simple, but the underlying proof is much more involved. I will try to provide a brief intuition of the result and provide some insight. Refer to the paper as well as literature for the detailed proofs.

### Assumptions

The paper makes a few assumptions, the first two are defined on f and the last one is on M.

Briefly, Lipschitzness is the quantitative version of continuity. If x and y are a given distance apart, d(x,y), then a Lipschitz function puts a quantitative upper bound on how far apart f(x) and f(y) can be. If C is the Lipschitz value of f, then this distance is at most C×d(x, y). Assumptions 1 and 2 are standard Lipschitz conditions for the gradient and Hessian, i.e., the constants β and ρ describes the “worst case” stretching apart behavior of the gradient and Hessian, respectively.
These assumptions are mainly to ensure differentiability of the gradient and Hessian.

Similarly, assumption 3 places an upper bound on the sectional curvature of M. Intuitively, this assumption is to ensure the exponential maps do not blow up. For example, consider a point x∈M, its tangent space TxM, and the curvature around x is extremely large. The exponential map of a point in TxM, Exp_x (v), might result in a point y∈M that is extremely far away from x. This will render the algorithm useless as the algorithm only wishes to slightly perturb from a (saddle) point to another point so that it can escape and move towards another stationary point.

### Main Theorem

The main theorem is stated below. Essentially, the theorem guarantees the rate of decrease in objective value function (converging to a stationary point). The proof strategy is to show with high probability that the function value decreases in a certain number of iterations when at an approximate saddle point.

The above expression may seem very complicated, but some insights can be extracted.

1. Looking at the involvement of β within the big-O notation and the step-size rule, we can see that the larger the Lipschitz constant β for the gradient, the longer it will take for the algorithm to converge.
2. The involvement of ρ is directly related to the parameter ϵ, which the perturbed Riemannian gradient descent algorithm will escape saddle points with a rate of order 1/(2 ϵ^2).
3. The dimensions of the manifold is another parameter effecting ϵ. It can be seen that d effects the convergence rate logarithmically.

The proof strategy of the paper is to show with high probability that the function value decreases after some iterations at an approximate saddle. The authors further proved that after a perturbation is made and T steps of the algorithm are executed, if the iterate is far from the saddle point, then the function is guaranteed to have decreased.

### Conclusion

In the paper, the authors examined the constrained optimization problem of minimizing f(x) subject to a few constraints on the manifold as well as some assumptions on f(x). They proved that as long as the function and the manifold are appropriately smooth, a perturbed Riemannian gradient descent algorithm will escape saddle points.

Escaping From Saddle Points on Riemannian Manifolds is on arXiv.

Author: Joshua Chou | Editor: Joni Zhong & Michael Sarazen 1. 