AMSGrad Optimization Algorithm

AMSGrad algorithm is an extension to the Adaptive Movement Estimation (Adam) optimization algorithm. More broadly, is an extension to the Gradient Descent Optimization algorithm.

The algorithm was described in the 2018 paper by J. Sashank, et al. titled “On the Convergence of Adam and Beyond.”

Generally, Adam automatically adapts a separate step size (learning rate) for each parameter in the optimization problem.

A limitation of Adam is that it can both decrease the step size when getting close to the optima, which is good, but it also increases the step size in some cases, which is bad.

AdaGrad addresses this specifically.

… ADAM aggressively increases the learning rate, however, […] this can be detrimental to the overall performance of the algorithm. […] In contrast, AMSGRAD neither increases nor decreases the learning rate and furthermore, decreases vt which can potentially lead to non-decreasing learning rate even if gradient is large in the future iterations.

On the Convergence of Adam and Beyond, 2018.

AdaGrad is an extension to Adam that maintains a maximum of the second moment vector and uses it to bias-correct the gradient used to update the parameter, instead of the moment vector itself. This helps to stop the optimization slowing down too quickly (e.g. premature convergence).

The key difference of AMSGRAD with ADAM is that it maintains the maximum of all vt until the present time step and uses this maximum value for normalizing the running average of the gradient instead of vt in ADAM.

On the Convergence of Adam and Beyond, 2018.

Let’s step through each element of the algorithm.

First, we must maintain a first and second moment vector as well as a max second moment vector for each parameter being optimized as part of the search, referred to as m and v (Geek letter nu but we will use v), and vhat respectively.

They are initialized to 0.0 at the start of the search.

  • m = 0
  • v = 0
  • vhat = 0

The algorithm is executed iteratively over time t starting at t=1, and each iteration involves calculating a new set of parameter values x, e.g. going from x(t-1) to x(t).

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradients (partial derivatives) are calculated for the current time step.

  • g(t) = f’(x(t-1))

Next, the first moment vector is updated using the gradient and a hyperparameter beta1.

  • m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

The beta1 hyperparameter can be held constant or can be decayed exponentially over the course of the search, such as:

  • beta1(t) = beta1^(t)

Or, alternately:

  • beta1(t) = beta1 / t

The second moment vector is updated using the square of the gradient and a hyperparameter beta2.

  • v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

Next, the maximum for the second moment vector is updated.

  • vhat(t) = max(vhat(t-1), v(t))

Where max() calculates the maximum of the provided values.

The parameter value can then be updated using the calculated terms and the step size hyperparameter alpha.

  • x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

Where sqrt() is the square root function.

The step size may also be held constant or decayed exponentially.

To review, there are three hyperparameters for the algorithm; they are:

  • alpha: Initial step size (learning rate), a typical value is 0.002.
  • beta1: Decay factor for first momentum, a typical value is 0.9.
  • beta2: Decay factor for infinity norm, a typical value is 0.999.

And that’s it.