AdaMax algorithm is an extension to the Adaptive Movement Estimation (Adam) Optimization algorithm. More broadly, is an extension to the Gradient Descent Optimization algorithm.

The algorithm was described in the 2014 paper by Diederik Kingma and Jimmy Lei Ba titled “Adam: A Method for Stochastic Optimization.”

Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.

In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled) L^2 norm of their individual current and past gradients

— Adam: A Method for Stochastic Optimization, 2014.

Generally, AdaMax automatically adapts a separate step size (learning rate) for each parameter in the optimization problem.

Let’s step through each element of the algorithm.

First, we must maintain a moment vector and exponentially weighted infinity norm for each parameter being optimized as part of the search, referred to as *m* and *u* respectively.

They are initialized to 0.0 at the start of the search.

- m = 0
- u = 0

The algorithm is executed iteratively over time t starting at t=1, and each iteration involves calculating a new set of parameter values x, e.g. going from *x(t-1)* to *x(t)*.

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradient (partial derivatives) are calculated for the current time step.

- g(t) = f’(x(t-1))

Next, the moment vector is updated using the gradient and a hyperparameter *beta1*.

- m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

The exponentially weighted infinity norm is updated using the *beta2* hyperparameter.

- u(t) = max(beta2 * u(t-1), abs(g(t)))

Where *max()* selects the maximum of the parameters and *abs()* calculates the absolute value.

We can then update the parameter value. This can be broken down into three pieces; the first calculates the step size parameter, the second the gradient, and the third uses the step size and gradient to calculate the new parameter value.

Let’s start with calculating the step size for the parameter using an initial step size hyperparameter called *alpha* and a version of *beta1* that is decaying over time with a specific value for this time step *beta1(t)*:

- step_size(t) = alpha / (1 – beta1(t))

The gradient used for updating the parameter is calculated as follows:

- delta(t) = m(t) / u(t)

Finally, we can calculate the value for the parameter for this iteration.

- x(t) = x(t-1) – step_size(t) * delta(t)

Or the complete update equation can be stated as:

- x(t) = x(t-1) – (alpha / (1 – beta1(t))) * m(t) / u(t)

To review, there are three hyperparameters for the algorithm; they are:

**alpha**: Initial step size (learning rate), a typical value is 0.002.**beta1**: Decay factor for first momentum, a typical value is 0.9.**beta2**: Decay factor for infinity norm, a typical value is 0.999.

The decay schedule for beta1(t) suggested in the paper is calculated using the initial beta1 value raised to the power t, although other decay schedules could be used such as holding the value constant or decaying more aggressively.

- beta1(t) = beta1^t

And that’s it.