# Gradient Descent Optimization With Adam

We can apply the gradient descent with Adam to the test problem.

First, we need a function that calculates the derivative for this function.

f(x) = x^2

f’(x) = x * 2

The derivative of x^2 is x * 2 in each dimension. The derivative() function implements this below.

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

score = objective(x[0], x[1])

Next, we need to initialize the first and second moments to zero.

# initialize first and second moments

m = [0.0 for _ in range(bounds.shape[0])]

v = [0.0 for _ in range(bounds.shape[0])]

We then run a fixed number of iterations of the algorithm defined by the “n_iter” hyperparameter.

# run iterations of gradient descent

for t in range(n_iter):

…

The first step is to calculate the gradient for the current solution using the derivative() function.

# calculate gradient

gradient = derivative(solution[0], solution[1])

The first step is to calculate the derivative for the current set of parameters.

# calculate gradient g(t)

g = derivative(x[0], x[1])

Next, we need to perform the Adam update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

…

# build a solution one variable at a time

for i in range(x.shape[0]):

…

First, we need to calculate the moment.

# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)

m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]

Then the second moment.

# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2

v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2

Then the bias correction for the first and second moments.

# mhat(t) = m(t) / (1 - beta1(t))

mhat = m[i] / (1.0 - beta1**(t+1))

# vhat(t) = v(t) / (1 - beta2(t))

vhat = v[i] / (1.0 - beta2**(t+1))

Then finally the updated variable value.

# x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps)

x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)

This is then repeated for each parameter that is being optimized.

At the end of the iteration we can evaluate the new parameter values and report the performance of the search.

…

# evaluate candidate point

score = objective(x[0], x[1])

# report progress

print(’>%d f(%s) = %.5f’ % (t, x, score))

We can tie all of this together into a function named adam() that takes the names of the objective and derivative functions as well as the algorithm hyperparameters, and returns the best solution found at the end of the search and its evaluation.

This complete function is listed below.

# gradient descent algorithm with adam

def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

score = objective(x[0], x[1])

# initialize first and second moments

m = [0.0 for _ in range(bounds.shape[0])]

v = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent updates

for t in range(n_iter):

# calculate gradient g(t)

g = derivative(x[0], x[1])

# build a solution one variable at a time

for i in range(x.shape[0]):

# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)

m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]

# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2

v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2
# mhat(t) = m(t) / (1 - beta1(t))
mhat = m[i] / (1.0 - beta1**(t+1))

# vhat(t) = v(t) / (1 - beta2(t))

vhat = v[i] / (1.0 - beta2**(t+1))

# x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps)

x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)

# evaluate candidate point

score = objective(x[0], x[1])

# report progress

print(’>%d f(%s) = %.5f’ % (t, x, score))

return [x, score]

Note: we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorized implementation with NumPy arrays for better performance.

We can then define our hyperparameters and call the adam() function to optimize our test objective function.

In this case, we will use 60 iterations of the algorithm with an initial steps size of 0.02 and beta1 and beta2 values of 0.8 and 0.999 respectively. These hyperparameter values were found after a little trial and error.

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 60

# steps size

alpha = 0.02

# factor for average gradient

beta1 = 0.8

# factor for average squared gradient

beta2 = 0.999

# perform the gradient descent search with adam

best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

print(‘Done!’)

print(‘f(%s) = %f’ % (best, score))

Tying all of this together, the complete example of gradient descent optimization with Adam is listed below.

# gradient descent optimization with adam for a two-dimensional test function

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adam

def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

score = objective(x[0], x[1])

# initialize first and second moments

m = [0.0 for _ in range(bounds.shape[0])]

v = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent updates

for t in range(n_iter):

# calculate gradient g(t)

g = derivative(x[0], x[1])

# build a solution one variable at a time

for i in range(x.shape[0]):

# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)

m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]

# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2

v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2
# mhat(t) = m(t) / (1 - beta1(t))
mhat = m[i] / (1.0 - beta1**(t+1))

# vhat(t) = v(t) / (1 - beta2(t))

vhat = v[i] / (1.0 - beta2**(t+1))

# x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps)

x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)

# evaluate candidate point

score = objective(x[0], x[1])

# report progress

print(’>%d f(%s) = %.5f’ % (t, x, score))

return [x, score]

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 60

# steps size

alpha = 0.02

# factor for average gradient

beta1 = 0.8

# factor for average squared gradient

beta2 = 0.999

# perform the gradient descent search with adam

best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

print(‘Done!’)

print(‘f(%s) = %f’ % (best, score))

Running the example applies the Adam optimization algorithm to our test problem and reports the performance of the search for each iteration of the algorithm.