Adam Optimization Algorithm

swapneel-panda-419bc751 · 30 June 2021 15:26

Adaptive Movement Estimation algorithm, or Adam for short, is an extension to the gradient descent optimization algorithm.

The algorithm was described in the 2014 paper by Diederik Kingma and Jimmy Lei Ba titled “Adam: A Method for Stochastic Optimization.”

Adam is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

This is achieved by calculating a step size for each input parameter that is being optimized. Importantly, each step size is automatically adapted throughput the search process based on the gradients (partial derivatives) encountered for each variable.

We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation

— Adam: A Method for Stochastic Optimization

This involves maintaining a first and second moment of the gradient, e.g. an exponentially decaying mean gradient (first moment) and variance (second moment) for each input variable.

The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient.

— Adam: A Method for Stochastic Optimization

Let’s step through each element of the algorithm.

First, we must maintain a moment vector and exponentially weighted infinity norm for each parameter being optimized as part of the search, referred to as m and v (really the Greek letter nu) respectively. They are initialized to 0.0 at the start of the search.

m = 0
v = 0

The algorithm is executed iteratively over time t starting at t=1, and each iteration involves calculating a new set of parameter values x, e.g. going from x(t-1) to x(t).

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradient (partial derivatives) are calculated for the current time step.

g(t) = f’(x(t-1))

Next, the first moment is updated using the gradient and a hyperparameter beta1.

m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

Then the second moment is updated using the squared gradient and a hyperparameter beta2.

v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

The first and second moments are biased because they are initialized with zero values.

… these moving averages are initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially during the initial timesteps, and especially when the decay rates are small (i.e. the betas are close to 1). The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected estimates …

— Adam: A Method for Stochastic Optimization

Next the first and second moments are bias-corrected, starring with the first moment:

mhat(t) = m(t) / (1 – beta1(t))

And then the second moment:

vhat(t) = v(t) / (1 – beta2(t))

Note, beta1(t) and beta2(t) refer to the beta1 and beta2 hyperparameters that are decayed on a schedule over the iterations of the algorithm. A static decay schedule can be used, although the paper recommend the following:

beta1(t) = beta1^t
beta2(t) = beta2^t

Finally, we can calculate the value for the parameter for this iteration.

x(t) = x(t-1) – alpha * mhat(t) / (sqrt(vhat(t)) + eps)

Where alpha is the step size hyperparameter, eps is a small value (epsilon) such as 1e-8 that ensures we do not encounter a divide by zero error, and sqrt() is the square root function.

Note, a more efficient reordering of the update rule listed in the paper can be used:

alpha(t) = alpha * sqrt(1 – beta2(t)) / (1 – beta1(t))
x(t) = x(t-1) – alpha(t) * m(t) / (sqrt(v(t)) + eps)

To review, there are three hyperparameters for the algorithm, they are:

alpha: Initial step size (learning rate), a typical value is 0.001.
beta1: Decay factor for first momentum, a typical value is 0.9.
beta2: Decay factor for infinity norm, a typical value is 0.999.

And that’s it.