# Synced Tradition and Machine Learning Series | Part 2: Optimization Basics

This is the second in a special Synced series of introductory articles on traditionally theoretical fields of studies and their impact on modern-day machine learning.

This is the second in a special Synced series of introductory articles on traditionally theoretical fields of studies and their impact on modern-day machine learning. Part 1: Entropy is here.

## 1. Introduction

The purpose of this article is to provide a comprehensive account of mathematical optimization and some applications in machine learning at the introductory level. We will go over the basics of optimization and discuss some of the core concepts of optimization algorithms. We will also introduce a number of important descent algorithms, such as steepest descent, Newton’s method and variations on these two. In addition, we will extend these ideas to neural network training to provide further insight on these classic topics and their influence in machine learning.

## 2. Background

The main ideas underlying optimality conditions in optimization typically can be understood intuitively despite their proofs being sometimes tedious. I will first define the basic concepts in optimization and discuss these concepts informally. Later, I will try to hopefully provide some details regarding the proofs.

### 2.1 Local and Global Minima

A vector x* is an unconstrained local minimum of a function f:R^n ↦ R if it is no worse than its neighbours. Mathematically, x* is a unconstrained local minimum if there exists an ε > 0 such that

A vector x* is an unconstrained global minimum of f if

The unconstrained local or global minimum x* is strict if the inequalities above are strict for all x x*.

These definitions can be extended to the case where the objective function f is subject to a constraint set X ⊂ R^n. The points in X are known as feasible points. Typically, when dealing with functions subject to a feasible set, we say that x* is a local minimum of f over X if x*∈ X and there exists an ε>0 such that

Similarly, the local and global maxima can be defined with the inequalities reversed.

### 2.2 Necessary Conditions for Optimality

If the objective function f is differentiable, gradients can be used to compare the “cost” of a vector with the cost of its close neighbours. That is, we can consider small perturbations, ∆x, from a given vector x*, which approximately yield cost variations up to the first and second orders

and

respectively. We expect that if x* is an unconstrained local minimum, the first order cost variation due to a small perturbation ∆x is non-negative, that is,

This can be proven using the basic principles of calculus. I will provide the 1-dimensional proof here.

Consider the objective function f(x) with a local minimum at x*
Let t be a scalar, 0≤t≤T. The point h(t) = x*+tv, where h(t)∈X.
For 0≤t≤T, define g(t) = f(h(t)).
g(t=0) = f(h(0)) = f(x*) = the local minimum.
By calculus, g(t)-g(0) = g(0)′(t-0) + o(t)
= g(0)′t + o(t).
(o(t) is the small-o notation)

If g′(0)<0, for small t>0, we will have
RHS : g(0)′t < 0, then the LHS will become
LHS : g(t)-g(0)<0.
g(t)-g(0) must be greater or equal to zero!

Another more useful theorem for local minimum is as follows. Suppose f(x) is differentiable over some set X⊂ R^n and f attains a local minimum at x*∈ X (x* is in the interior of X), then ∇f(x*)=0. The proof for this is simple.

Since x* is in the interior of X, all directions are feasible.
If ∇f(x*)≠0 -> for δf(x*)/δ(x_i)≠0,
then for some i in {1, …, n},
if δf(x*)/δ(x_i) > 0, consider vector v = -e_i -> (δf(x*)/δ(x_i))⋅v<0,
if δf(x*)/δ(x_i) < 0, consider vector v = +e_i -> (δf(x*)/δ(x_i))⋅v<0,
where e corresponds a vector with component i = 1 and all other components 0.

The result of the two situations contradict the original statement that x*
is the local minimum. Therefore, ∇f(x*)=0 at x*!
It is also expected that the second order cost variations due to a small variation ∆x must also be non-negative, that is,

Since the first term is zero for a local minimum, we have

which implies that ∇^2 (f(x*)) is positive semidefinite. We will look at the semidefinite property later on in this article.

### 2.3 Sufficient Conditions for Optimality

If the objective function is not convex, the first and second order necessary conditions do not guarantee local optimality of x*. To give an example, the function f(x) = x^3 satisfies both the first and second order necessary conditions for optimality at x = 0. However, x = 0 is an inflection point and is neither a local maximum nor minimum.

However, we can strengthen the second order condition to obtain sufficient conditions for optimality. In particular, consider a vector x* that satisfies the first order optimality condition, i.e., ∇f(x*)=0, and also satisfies the strengthened second order necessary conditional, i.e., ∇^2 (f(x*))>0, that is,

This implies that at x*, the second order variation of f due to a small nonzero variation, ∆x, is positive. In other words, f tends to increase strictly with small excursions from x*, suggesting that the first order condition and strengthened second order conditions are sufficient for local optimality of x*.

Existence of Optimal Solutions and Why Are They Important?
In many cases, it is useful to determine whether a feasible solution exists. In other words, determine if there is at least one global minimum of a function f over a set X. The existence of at least one global minimum is guaranteed if f is continuous and X is a compact subset of R^n. This is known as the Weierstrass theorem.

Optimality conditions are fundamental to the analysis of an optimization problem. In practice, however, optimality conditions play an important role in a variety of contexts. For example, there are cases where we can use optimality conditions to solve optimization problems. This is done by

1. Find all points which satisfy the first order necessary condition ∇f(x*)=0.
2. If f is not convex, check the second order necessary condition ∇^2 (f(x*)) is positive semi-definite. Filter out the candidates that do not satisfy this condition.
3. For the remaining candidates, check if ∇^2 (f(x*)) is positive definite. The candidates that satisfy this condition are guaranteed to be a local minima.

In summary, optimality conditions are useful because

• they provide a means of guaranteeing that a candidate solution is indeed optimal (sufficient conditions), and
• they indicate when a point is not optimal (necessary conditions)
• since they provide indication of improvement (or lack of optimality), they often guide the design of optimization algorithms.

## 3. Gradient Methods – Direction

In this section, I will introduce the computational methods for unconstrained optimization.

### 3.1 Descent Direction

The main principles of unconstrained optimization methods often have simple geometrical explanations.

Consider the problem of unconstrained minimization of a continuous, differentiable function f:R^n ↦ R. In order to minimize f, we rely on iterative descent. The basic concept of iterative descent is as follows:

1. Start at some point x^[t=0] (as our initial guess), where the notation x^[t=T] represents the value of x at time t=T.
2. Generate vectors x^, x^, x^, …, such that f is decreased at each iteration,
i.e., f(x^[k+1]) ≤ f(x^[k]), k = 0, 1, 2, ….

In doing so, we successively improve our current solution estimate and hope to decrease f all the way to its minimum. Figure 1. provides a simple 2-D illustration of the idea of iterative descent. Figure 1. Let x* be the global optimal of f. Iterative descent starts at point x^, and at the next iteration, moves to x^, which has a lower value f(x^) compared to f(x^). This process is repeated until x^[t] (in this figure t=4) reaches x*.

A popular approach for choosing the descent “direction” is the use of gradient methods. Given a vector x∈R^n with ∇f(x)≠0, consider the half line of vectors

From the first order expansion around x we can obtain the following:

which we can write as

since o(α ||∇f(x)||) is a function of α. The term α ||∇f(x)||^2 dominates o(α) for small α. Therefore, for positive but sufficiently small α, f(x_α) is smaller than f(x). Figure 2. illustrates this geometrically. Figure 2. if ∇f(x)≠0, there will be an interval (0, δ) of stepsizes such thatf(x-α ∇f(x)) < f(x), for all α ∈ (0, δ).

Next, consider the half line of vectors,

where vector d ∈ R^n denotes a direction. If d makes an angle with the gradient ∇f(x) that is greater that 90 degrees, we will have

Recall that we have

Thus, for α near zero, the term α∇f(x)’d dominates o(α) and as a result, for positive and sufficiently small α, f(x+αd) is smaller than f(x). This is illustrated in Figure 3. Figure 3. The gradient vector ∇f(x) is pointing to the direction of largest rate of increase, while the red line forms a 90 degree angle with ∇f(x). Any vector above (dash-lines) the red line forms an angle <90 degrees with ∇f(x) while any vector below the red line forms an angle >90 degrees with ∇f(x).

From Figure 3., we can see that as long as the direction vector d forms an angle greater than 90 degrees to ∇f(x), we will have f(x+αd) < f(x) for a positive and sufficiently small (note this requires ∇f(x)≠0).

There are a large variety of possibilities for choosing the direction d^[k] (direction at time k) and the stepsize α^[k] in a gradient method. The above discussion provides us an insight into how to select the direction. Next, we will introduce a few traditional methods for selection of the descent direction.

### 3.2 Selecting the Descent Direction

In this section we will go over a few well-known descent directions, such as steepest descent, Newton’s method and the Gauss-Newton method.

In the most traditional sense, many gradient methods are specified in the form

where

• x^[t+1] : value of x at time step t+1
• x^[t] : value of x at time step t
• D^[t] : positive definite symmetric matrix
• ∇f(x^[t]) : value of the gradient at x^[t]

Following our previous discussion, we choose the next value of x at (t+1) according to

In this case, d^[t] = -D^[t]∇f(x^[t]) , and we know that the descent condition is

∇f(x^[t])’d^[k] < 0

and therefore,

∇f(x^[t])’d^[k] = ∇f(x^[t])'(-D^[t]∇f(x^[t]))
= -(∇f(x^[t])’D^[t]∇f(x^[t]))
< 0

will hold thanks to the positive definiteness of D^[k].

### 3.3 Steepest Descent Method

We will now discuss a well-known method known as the steepest descent algorithm.

In the steepest descent algorithm, D^[k] = I (the identity matrix) for all k. This is a simple choice but has a consequence of slow convergence.

In steepest descent, the direction d that is chosen at each time instance t is the normalized negative gradient, i.e.,

This is because, of all possible normalized directions, d ∈ R^n, – ∇f(x^[t])/ || ∇f(x^[t]) || is the one that will minimize the slope of ∇f(x^[t])‘d of the objective f(x^[t] + αd) along the direction of d at α = 0.

As mentioned earlier, steepest descent sometimes suffers from slow convergence. This can be geometrically visualized in Figure 4. If the cost function level curves as elongated ellipsoids and the starting point x^ is chosen poorly, the steepest descent algorithm will exhibit a zig-zag behaviour. This will result in slow convergence.

### 3.4 Newton’s Method

Newton’s method selects D^[t] to be

provided that ∇^2f(x^[t]) is positive definite. The main usage of Newton’s method is to minimize, at each iteration, the quadratic approximation of f around the current point x^[t] given by the following:

By setting the derivative of f^[t] to zero,

we can obtain the next iterate x^[t+1] as the minimum of f^[t](x). Plugging in x^[t+1] in the place of x, we can determine x^[t+1] as follows:

This is known as the pure Newton iteration. Note that for a pure Newton step, the stepsize α=1. In general, Newton’s method can find the global minimum of a positive definite quadratic function in a single iteration. For other cost functions, Newton’s method typically converges very fast asymptotically and does not exhibit the zig-zag behaviour as seen in steepest descent. Figure 6. illustrates how Newton’s method approximates quadratically the cost function at each iteration. Figure 6. Convergence rate of Newton’s method with stepsize α=1. Given x^[t], the method obtains x^[t+1] as the minimum of an approximation of f based on a second order expansion around x^[k].

In addition, Figure 7. illustrates Newton’s method’s ability to converge fast compared to steepest descent. Consider the function f(x) = (10x^2 + y^2 )/2 + 5 log(1 + e^(−x−y) ) and the starting point (x^, y^). We can see the difference in convergence for the steepest descent method (black) versus the Newton’s method (blue).

We can see that for steps taken with roughly the same size, Newton’s method requires significantly fewer iterations than the steepest descent algorithm.

### 3.5 Diagonally Scaled Steepest Descent

Now we introduce a variation of the steepest descent method. Instead of using D^[k] as the identity matrix, D^[k] is

where d^[t]_i are positive scalars, ensuring that D^t is positive definite. This idea of scaling is important because of the way that many scientific and engineering problems are initially formulated. Typically, these problems are due to variables having widely differing magnitudes. For example, variables have different measuring units that can lead to the optimization variables having orders of magnitude differences. Scaling can reduce these differences, and as a general rule of thumb, we would like all the variables in an optimization problem to have roughly similar magnitudes. This leads to better decisions on which search direction to choose as well as in deciding when convergence is achieved.

### 3.5 Gauss-Newton Method

We will introduce the Gauss-Newton method in this section. This is a method applied to the problem of minimizing the sum of squares of real-valued functions g_i, g_2, …, g_m. These types of problems often arise in statistical data analysis and in the context of neural network training.

Consider the following problem

where g = (g_1, g_2, …, g_m). By choosing D^[t] to be

given that the matrix ∇g(x^[t])∇g(x^[t])’ is invertible, the latter matrix is always positive semidefinite. It is positive definite and hence invertible if and only if the matrix ∇g(x^[t]) has rank n.

This is true because of the following proposition:

For any m x n matrix A, the matrix A′A is symmetric and non-negative definite.
The matrix A′A is positive definite if and only if A has rank n.
In particular, if m=n, A′A is positive definite if and only if A is non-singular.

The proof of the proposition is provided.

1. symmetry of A′A is obvious.
2. For any vector x ∈ R^n, we have x′A′Ax = ||Ax||^2 ≥ 0,
which establishes the non-negative definiteness.
3. Positive definiteness is obtained if and only if the inequality is strict for
every x≠0, which is the case if and only if Ax≠0 for every x≠0.
This is equivalent to A having rank n, and therefore non-singular.

Since

the Gauss-Newton method for the next x value can be expressed as the following:

As mentioned, the Gauss-Newton method is used to optimize objective functions which have the form of sums of squares. Thus, given the optimization problem of the following,

the algorithm works through the following steps

0. start at some initial guess x^[t=0], t = 1, 2, …,
1. linearize f around x^[t],
g(x) ≈ g(x^[t]) + g′(x^[t])(x-x^[t])
2. substitute the affine approximation of g in terms of a least square problem
minimize || g(x^[t]) + g′(x^[t])(x-x^[t]) ||^2
3. find the solution to this least squares problem and that will be x^[t+1]
4. repeat steps 1 to 3 until termination condition

## 4. Gradient Methods – Stepsize

In the previous section, we were concerned about which direction the algorithm would descend in. In this section, we will discuss the length we move in the direction of descent. In neural networks, the stepsize is often called the learning rate. There are a number of rules for choosing the step size α^[t] in a gradient method. We will introduce a few of the common ones in this section.

### 4.1 Constant Stepsize

This is the simplest stepsize rule. Here, the stepsize s is fixed throughout the algorithm. The constant stepsize rule is simple, but suffers from bad rates of convergence if not chosen correctly. With a too small stepsize the rate of convergence may be very slow while with too large a stepsize you may encounter divergence. Figure 8. illustrates the two cases.

Typically, one would use this rule only if the problem is one where a constant stepsize value is know or can be determined easily.

### 4.2 Minimization Rule

This is another simple and easy to understand method. Here, we choose α^[t] to be the value which minimizes the function f(x) along the direction d^[t]. That is, α^[t] satisfies

### 4.3 Limited Minimization Rule

This is a simple variation of the Minimization rule. For a fixed scaler s>0, α^[t] is selected such that it satisfies

That is, α^[t] is selected from a range [0, s] such that it yields the greatest cost reduction within the interval [0, s]. In general, the minimizing stepsize cannot be computed exactly, and in practice, a line search algorithm is used to stop the search for the “best” α^[t] once a certain condition is met. Compared to some other stepsize rules, minimization rules require more computation since they require more function and/or gradient evaluations per iteration. However, their use tends to reduce the overall number of required iterations for convergence in practice. This is because of the greater cost reduction at each iteration.

### 4.4 Diminishing Stepsize Rule

In the diminishing stepsize rule, the stepsize starts from α^ and slowly decreases until 0. Intuitively, this stepsize makes sense as it starts off with a rate of travel in the descent direction, then decreases the stepsize to fine-tune the result as it approaches the optimal value. However, this stepsize rule does not guarantee descent at each iteration. One difficulty with a diminishing stepsize is that it may become so small that substantial progress cannot be maintained, even when far from a stationary point.

## 5. Optimization Examples in Machine Learning

We have covered some of the main concepts of unconstrained optimization, mainly optimality conditions, algorithms and their respective hyperparameters and considerations. We will now look at some examples of these concepts applied in machine learning.

A classic use case of gradient methods is for the problem of maximizing likelihood estimation for Gaussian mixture models. Gradient methods are used to provide a generic and simple optimization approach that iteratively updates parameters to travel up the gradient (since we are maximizing likelihood). Machine learning problems often involve huge datasets, and therefore, stochastic variants of the gradient method are typically used. These methods randomly choose a sample and update the parameters to travel up the gradient for the selected sample. Note that the (stochastic) gradient method does not only necessarily give the global optimal solution but also a local optimal solution.

Gradient methods have also been used for model construction, also known as curve fitting. Such instances may occur when we want to estimate n parameters of a mathematical model so that it fits well in a physical system based on a set of input-output data. That is, we hypothesize an approximate relation of the form

where h is a known function representing the model and

• x ∈ R^n is a vector of unknown parameters,
• z ∈ R^n is the model’s output,
• y ∈ R^n is the model’s input.

Given a set of m input-output data pairs (y_1, z_1), (y_2, z_2), …, (y_m, z_m) from measurements of the physical system that we are trying to model, we want to find the vector of parameters x that best matches the data in the sense that it minimizes the sum of squared errors

Problems that can be formulated into least squares problems can be solved using the different gradient methods we discussed in the previous section. This is also the case with neural networks.

Let’s look at a multilayer perceptron model as an example. The k-th stage of the model consists of n_k activation units, each being a single input-single output mapping of a given form Φ : R ↦ R. The output of the j-th activation unit of the (k+1)-th stage is denoted by x_[k+1],j and the input is a linear function of the output vector x_k = (x_{k,1}, x_{k,2}, …, x_{k,n_k} of the k-th stage. Thus, we have

where w_{k, s_j} are the parameters/coefficients/weights to be determined. Suppose that the multiplayer perceptron has N stages, and let w denote the vector of weights of all stages, i.e.,

Then, for a given vector w of weights, an input vector x_0 to the first stage produces a unique output vector x_N from the Nth stage. We can view the multilayer perceptron as mapping h that is parameterized by w and transforming the input vector x_0 into an output vector of the form x_N = h(w, x_0). Recall that if we have m samples of input-output data pairs (y_1, z_1), (y_2, z_2), …, (y_m, z_m) from a system we are trying to model, we can try to match the mapping of the system into a sum of squared errors problem.

In neural network terminology, finding the optimal weights w is referred to as training the network, and can be formulated into a least squares problem. Typically, sigmoids and tanh functions are chosen to be the activation function Φ. Thus, for a given data set of input-output pairs (y_1, z_1), (y_2, z_2), …, (y_m, z_m) and an activation function Φ, the least square function can be formulated as

This again is a least squares problem and can be solved through gradient methods. Note however, that neural network training problems can be quite challenging as their cost functions are typically nonconvex and involve multiple local minima.

## 6. Optimization Practices in Machine Learning, Traditional and Recent

In this section, we will discuss some of the recent advances in optimization for gradient descent algorithms applied in machine learning. In the previous section, the introduction to optimization was done with a traditional optimization/mathematical background in mind. Entering a machine learning setting, we will change the notations slightly to better match machine learning notation conventions, and use the notations below throughout the rest of this section.

• θ_[t]: refers to the weights and parameters of the neural network at time t.
• g_[t]: is the (usually stochastic) gradient. The gradient is the derivative of some loss function L(θ_[t]).
• L(θ_[t]): the objective (or loss) function that is to be optimized.
• η: is the learning rate. The learning rate can be parameterized as discussed in the previous sections.

We will start off with some of the fundamental methods and slowly work our way towards the more recent gradient methods.

Vanilla gradient descent is the most basic of gradient methods. It follows the below update rule at each iteration with some learning rate η,

where the loss function is calculated with some number of samples drawn randomly from the entire training dataset. For Stochastic Gradient Descent (SGD), one sample is drawn per iteration. Because using an entire batch (dataset) is costly in terms of computation, mini-batches are generally used in practice, with their sizes typically ranging from 64 to 2048.

### Momentum

Momentum is an exponential moving average of past gradients parameterized by β, which leads to the update rule below

where

Momentum has a very simple intuitive explanation. Consider using SGD to traverse the surface of the loss function, where the surface appears as a mountain and we begin at the summit. The surface has many small crags but clearly heads down to a valley eventually. As SGD traverses the surface, the sign and magnitude of the derivative changes often, resulting in fluctuations in the direction and speed that SGD takes down the mountain. If we include a momentum term, then instead of just relying on the gradient at each new stopping point to determine the movement, the direction now also depends on the size and direction of the movement in the previous update. Furthermore, the amount and direction that SGD exhibits is a recursive behaviour, and therefore each movement depends on the entire history of previous movements.

AdaGrad is based on the intuition that weights which are infrequently updated should be updated with a larger learning rate than weights which are frequently updated. In other words, weights that rarely receive a large gradient should try to make the most of it when they do get one. This is realized by keeping a running additive sum of squared gradients per weight and dividing the learning rate by this running sum. Thus, for each element i in θ, we have the following update rule

where G_[t] is a diagonal preconditioning matrix as a summation of squares of past gradients, and ε>0 a small constant (typically to prevent the square root from being zero). These learning rate adaptive algorithms can be understood as utilizing current and past gradient information to design preconditioning matrices that better approximate the local curvature of the loss function.

### RMSprop (Root Mean Squared Backpropagation)

AdaGrad works with a running sum of squared gradients as the denominator to the learning rate. Since the running sum can only increase, an optimizer that has been running for a sufficiently long time will cause subsequent updates to be too small. RMSprop substitutes the ever-accumulating matrix G_[t] with a running average of squared gradients computed per iteration with a discount factor β_2. A good choice of β_2 is 0.9. The running average of squared gradients is computed as

where the ∘ operator represents per-coordinate multiplication. We can see that the resulting E matrix is a convex combination of the running summation of squares of past gradients and the current gradient square. Thus, the per-iteration update can be written as

The update rule is written as:

Adam was first published in 2014, and remains the most popular adaptive learning rate algorithm today.

We can observe that the result is a weighted average of the momentum and plain SGD, weighting the current gradient with an immediate discount factor ν_1, divided by the weighted average of the mean squared gradients.

### QHM (Quasi-Hyperbolic Momentum)

QHM is another relatively recent adaptive momentum algorithm which decouples the momentum term from the current gradient when updating the weights. In other words, it is a weighted average of the momentum and plain SGD, weighting the current gradient with convex combination coefficient ν. The update rules for the gradient and weights are as follows:

As we can see, many of these algorithms focus on selecting momentum hyperparameters. The common choices of 0, 0.9, or 0.99 as the momentum parameter β can yield dramatically different and unpredictable performance. This is because if β is chosen to be too small, it can significantly increase training time and reduce performance. On the other hand, if β is too large, the algorithm may exhibit oscillating behaviour and slow down convergence.

## 7. Concluding Remarks

In this review we have endeavoured to equip the reader with basic information on optimization and introduce some intuitions regarding the behaviour of different gradient descent methods and their considerations. We discussed optimality conditions and illustrated how they are important to driving the design of optimization algorithms. We also looked at the different variants of gradient methods and went over some popular design choices for descent direction and stepsize.

We then provided some machine learning examples in which these gradient methods are used. Finally, we brought the topic of gradient methods into a machine learning setting and discussed some of the most prominent gradient methods. We went over their update rules and provided some discussion into the breaking down of these rules.

We hope this article has provided the reader with an understanding of the basic concepts in optimization, the mechanisms behind some commonly used optimization methods, intuition on the convergence and rate of convergence properties of these optimization methods, and an appreciation of these algorithms’ roles in the context of machine learning.

Author: Joshua Chou | Editor: H4O & Michael Sarazen

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates. 1. 