Gradient Descent Optimization With AMSGrad

We can apply gradient descent with AMSGrad to the test problem.

First, we need a function that calculates the derivative for this function.

The derivative of x^2 is x * 2 in each dimension.

f(x) = x^2
f’(x) = x * 2
The derivative() function implements this below.

derivative of objective function

def derivative(x, y):
return asarray([x * 2.0, y * 2.0])
1
2
3

derivative of objective function

def derivative(x, y):
return asarray([x * 2.0, y * 2.0])
Next, we can implement gradient descent optimization with AMSGrad.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
1
2
3

generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
Next, we need to initialize the moment vectors.

initialize moment vectors

m = [0.0 for _ in range(bounds.shape[0])]
v = [0.0 for _ in range(bounds.shape[0])]
vhat = [0.0 for _ in range(bounds.shape[0])]
1
2
3
4
5

initialize moment vectors

m = [0.0 for _ in range(bounds.shape[0])]
v = [0.0 for _ in range(bounds.shape[0])]
vhat = [0.0 for _ in range(bounds.shape[0])]
We then run a fixed number of iterations of the algorithm defined by the “n_iter” hyperparameter.

run iterations of gradient descent

for t in range(n_iter):

1
2
3
4

run iterations of gradient descent

for t in range(n_iter):

The first step is to calculate the derivative for the current set of parameters.

calculate gradient g(t)

g = derivative(x[0], x[1])
1
2
3

calculate gradient g(t)

g = derivative(x[0], x[1])
Next, we need to perform the AMSGrad update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

build a solution one variable at a time

for i in range(x.shape[0]):

1
2
3
4

build a solution one variable at a time

for i in range(x.shape[0]):

First, we need to calculate the first moment vector.

m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t)

m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i]
1
2
3

m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t)

m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i]
Next, we need to calculate the second moment vector.

v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2

v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2
1
2
3

v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2

v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2
Then the maximum of the second moment vector with the previous iteration and the current value.

vhat(t) = max(vhat(t-1), v(t))

vhat[i] = max(vhat[i], v[i])
1
2
3

vhat(t) = max(vhat(t-1), v(t))

vhat[i] = max(vhat[i], v[i])
Finally, we can calculate the new value for the variable.

x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] - alpha * m[i] / sqrt(vhat[i])
1
2
3

x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] - alpha * m[i] / sqrt(vhat[i])
We may want to add a small value to the denominator to avoid a divide by zero error; for example:

x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8)
1
2
3

x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8)
This is then repeated for each parameter that is being optimized.

At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.

evaluate candidate point

score = objective(x[0], x[1])

report progress

print(’>%d f(%s) = %.5f’ % (t, x, score))
1
2
3
4
5

evaluate candidate point

score = objective(x[0], x[1])

report progress

print(’>%d f(%s) = %.5f’ % (t, x, score))
We can tie all of this together into a function named amsgrad() that takes the names of the objective and derivative functions as well as the algorithm hyperparameters, and returns the best solution found at the end of the search and its evaluation.

gradient descent algorithm with amsgrad

def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# initialize moment vectors
m = [0.0 for _ in range(bounds.shape[0])]
v = [0.0 for _ in range(bounds.shape[0])]
vhat = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# update variables one at a time
for i in range(x.shape[0]):
# m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t)
m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i]
# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2
# vhat(t) = max(vhat(t-1), v(t))
vhat[i] = max(vhat[i], v[i])
# x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t)))
x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8)
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print(’>%d f(%s) = %.5f’ % (t, x, score))
return [x, score]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

gradient descent algorithm with amsgrad

def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# initialize moment vectors
m = [0.0 for _ in range(bounds.shape[0])]
v = [0.0 for _ in range(bounds.shape[0])]
vhat = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# update variables one at a time
for i in range(x.shape[0]):
# m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t)
m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i]
# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2
# vhat(t) = max(vhat(t-1), v(t))
vhat[i] = max(vhat[i], v[i])
# x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t)))
x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8)
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print(’>%d f(%s) = %.5f’ % (t, x, score))
return [x, score]
We can then define the bounds of the function and the hyperparameters and call the function to perform the optimization.

In this case, we will run the algorithm for 100 iterations with an initial learning rate of 0.007, beta of 0.9, and a beta2 of 0.99, found after a little trial and error.

seed the pseudo random number generator

seed(1)

define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

define the total iterations

n_iter = 100

steps size

alpha = 0.007

factor for average gradient

beta1 = 0.9

factor for average squared gradient

beta2 = 0.99

perform the gradient descent search with amsgrad

best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

seed the pseudo random number generator

seed(1)

define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

define the total iterations

n_iter = 100

steps size

alpha = 0.007

factor for average gradient

beta1 = 0.9

factor for average squared gradient

beta2 = 0.99

perform the gradient descent search with amsgrad

best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
At the end of the run, we will report the best solution found.

summarize the result

print(‘Done!’)
print(‘f(%s) = %f’ % (best, score))
1
2
3
4

summarize the result

print(‘Done!’)
print(‘f(%s) = %f’ % (best, score))
Tying all of this together, the complete example of AMSGrad gradient descent applied to our test problem is listed below.

gradient descent optimization with amsgrad for a two-dimensional test function

from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

objective function

def objective(x, y):
return x2.0 + y2.0

derivative of objective function

def derivative(x, y):
return asarray([x * 2.0, y * 2.0])

gradient descent algorithm with amsgrad

def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# initialize moment vectors
m = [0.0 for _ in range(bounds.shape[0])]
v = [0.0 for _ in range(bounds.shape[0])]
vhat = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# update variables one at a time
for i in range(x.shape[0]):
# m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t)
m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i]
# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2
# vhat(t) = max(vhat(t-1), v(t))
vhat[i] = max(vhat[i], v[i])
# x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t)))
x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8)
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print(’>%d f(%s) = %.5f’ % (t, x, score))
return [x, score]

seed the pseudo random number generator

seed(1)

define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

define the total iterations

n_iter = 100

steps size

alpha = 0.007

factor for average gradient

beta1 = 0.9

factor for average squared gradient

beta2 = 0.99

perform the gradient descent search with amsgrad

best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
print(‘Done!’)
print(‘f(%s) = %f’ % (best, score))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

gradient descent optimization with amsgrad for a two-dimensional test function

from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

objective function

def objective(x, y):
return x2.0 + y2.0

derivative of objective function

def derivative(x, y):
return asarray([x * 2.0, y * 2.0])

gradient descent algorithm with amsgrad

def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# initialize moment vectors
m = [0.0 for _ in range(bounds.shape[0])]
v = [0.0 for _ in range(bounds.shape[0])]
vhat = [0.0 for _ in range(bounds.shape[0])]
# run the gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# update variables one at a time
for i in range(x.shape[0]):
# m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t)
m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i]
# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2
# vhat(t) = max(vhat(t-1), v(t))
vhat[i] = max(vhat[i], v[i])
# x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t)))
x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8)
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print(’>%d f(%s) = %.5f’ % (t, x, score))
return [x, score]

seed the pseudo random number generator

seed(1)

define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

define the total iterations

n_iter = 100

steps size

alpha = 0.007

factor for average gradient

beta1 = 0.9

factor for average squared gradient

beta2 = 0.99

perform the gradient descent search with amsgrad

best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
print(‘Done!’)
print(‘f(%s) = %f’ % (best, score))
Running the example applies the optimization algorithm with AMSGrad to our test problem and reports the performance of the search for each iteration of the algorithm.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 90 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

90 f([-5.74880707e-11 2.16227707e-03]) = 0.00000
91 f([-4.53359947e-11 2.03974264e-03]) = 0.00000
92 f([-3.57526928e-11 1.92415218e-03]) = 0.00000
93 f([-2.81951584e-11 1.81511216e-03]) = 0.00000
94 f([-2.22351711e-11 1.71225138e-03]) = 0.00000
95 f([-1.75350316e-11 1.61521966e-03]) = 0.00000
96 f([-1.38284262e-11 1.52368665e-03]) = 0.00000
97 f([-1.09053366e-11 1.43734076e-03]) = 0.00000
98 f([-8.60013947e-12 1.35588802e-03]) = 0.00000
99 f([-6.78222208e-12 1.27905115e-03]) = 0.00000
Done!
f([-6.78222208e-12 1.27905115e-03]) = 0.000002
1
2
3
4
5
6
7
8
9
10
11
12
13

90 f([-5.74880707e-11 2.16227707e-03]) = 0.00000
91 f([-4.53359947e-11 2.03974264e-03]) = 0.00000
92 f([-3.57526928e-11 1.92415218e-03]) = 0.00000
93 f([-2.81951584e-11 1.81511216e-03]) = 0.00000
94 f([-2.22351711e-11 1.71225138e-03]) = 0.00000
95 f([-1.75350316e-11 1.61521966e-03]) = 0.00000
96 f([-1.38284262e-11 1.52368665e-03]) = 0.00000
97 f([-1.09053366e-11 1.43734076e-03]) = 0.00000
98 f([-8.60013947e-12 1.35588802e-03]) = 0.00000
99 f([-6.78222208e-12 1.27905115e-03]) = 0.00000
Done!
f([-6.78222208e-12 1.27905115e-03]) = 0.000002