Gradient Descent Optimization With AdaMax

swapneel-panda-419bc751 · 9 June 2021 15:47

We can apply the gradient descent with AdaMax to the test problem.

First, we need a function that calculates the derivative for this function.

The derivative of x^2 is x * 2 in each dimension.

f(x) = x^2
f’(x) = x * 2
The derivative() function implements this below.

derivative of objective function

def derivative(x, y):
return asarray([x * 2.0, y * 2.0])
1
2
3

derivative of objective function

def derivative(x, y):
return asarray([x * 2.0, y * 2.0])
Next, we can implement gradient descent optimization with AdaMax.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

…

generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
1
2
3
…

generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
Next, we need to initialize the moment vector and exponentially weighted infinity norm.

…

initialize moment vector and weighted infinity norm

m = [0.0 for _ in range(bounds.shape[0])]
u = [0.0 for _ in range(bounds.shape[0])]
1
2
3
4
…

initialize moment vector and weighted infinity norm

m = [0.0 for _ in range(bounds.shape[0])]
u = [0.0 for _ in range(bounds.shape[0])]
We then run a fixed number of iterations of the algorithm defined by the “n_iter” hyperparameter.

…

run iterations of gradient descent

for t in range(n_iter):
…
1
2
3
4
…

run iterations of gradient descent

for t in range(n_iter):
…
The first step is to calculate the derivative for the current set of parameters.

…

calculate gradient g(t)

g = derivative(x[0], x[1])
1
2
3
…

calculate gradient g(t)

g = derivative(x[0], x[1])
Next, we need to perform the AdaMax update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

…

build a solution one variable at a time

for i in range(x.shape[0]):
…
1
2
3
4
…

build a solution one variable at a time

for i in range(x.shape[0]):
…
First, we need to calculate the moment vector.

…

m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)

m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
1
2
3
…

m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)

m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
Next, we need to calculate the exponentially weighted infinity norm.

…

u(t) = max(beta2 * u(t-1), abs(g(t)))

u[i] = max(beta2 * u[i], abs(g[i]))
1
2
3
…

u(t) = max(beta2 * u(t-1), abs(g(t)))

u[i] = max(beta2 * u[i], abs(g[i]))
Then the step size used in the update.

…

step_size(t) = alpha / (1 - beta1(t))

step_size = alpha / (1.0 - beta1**(t+1))
1
2
3
…

step_size(t) = alpha / (1 - beta1(t))

step_size = alpha / (1.0 - beta1**(t+1))
And the change in variable.

…

delta(t) = m(t) / u(t)

delta = m[i] / u[i]
1
2
3
…

delta(t) = m(t) / u(t)

delta = m[i] / u[i]
Finally, we can calculate the new value for the variable.

…

x(t) = x(t-1) - step_size(t) * delta(t)

x[i] = x[i] - step_size * delta
1
2
3
…

x(t) = x(t-1) - step_size(t) * delta(t)

x[i] = x[i] - step_size * delta
This is then repeated for each parameter that is being optimized.

At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.

…

evaluate candidate point

score = objective(x[0], x[1])

report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))
1
2
3
4
5
…

evaluate candidate point

score = objective(x[0], x[1])

report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))
We can tie all of this together into a function named adamax() that takes the names of the objective and derivative functions as well as the algorithm hyperparameters and returns the best solution found at the end of the search and its evaluation.

gradient descent algorithm with adamax

def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# initialize moment vector and weighted infinity norm
m = [0.0 for _ in range(bounds.shape[0])]
u = [0.0 for _ in range(bounds.shape[0])]
# run iterations of gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(x.shape[0]):
# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
# u(t) = max(beta2 * u(t-1), abs(g(t)))
u[i] = max(beta2 * u[i], abs(g[i]))
# step_size(t) = alpha / (1 - beta1(t))
step_size = alpha / (1.0 - beta1**(t+1))
# delta(t) = m(t) / u(t)
delta = m[i] / u[i]
# x(t) = x(t-1) - step_size(t) * delta(t)
x[i] = x[i] - step_size * delta
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print(‘>%d f(%s) = %.5f’ % (t, x, score))
return [x, score]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

gradient descent algorithm with adamax

def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# initialize moment vector and weighted infinity norm
m = [0.0 for _ in range(bounds.shape[0])]
u = [0.0 for _ in range(bounds.shape[0])]
# run iterations of gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(x.shape[0]):
# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
# u(t) = max(beta2 * u(t-1), abs(g(t)))
u[i] = max(beta2 * u[i], abs(g[i]))
# step_size(t) = alpha / (1 - beta1(t))
step_size = alpha / (1.0 - beta1**(t+1))
# delta(t) = m(t) / u(t)
delta = m[i] / u[i]
# x(t) = x(t-1) - step_size(t) * delta(t)
x[i] = x[i] - step_size * delta
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print(‘>%d f(%s) = %.5f’ % (t, x, score))
return [x, score]
We can then define the bounds of the function and the hyperparameters and call the function to perform the optimization.

In this case, we will run the algorithm for 60 iterations with an initial learning rate of 0.02, beta of 0.8, and a beta2 of 0.99, found after a little trial and error.

…

seed the pseudo random number generator

seed(1)

define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

define the total iterations

n_iter = 60

steps size

alpha = 0.02

factor for average gradient

beta1 = 0.8

factor for average squared gradient

beta2 = 0.99

perform the gradient descent search with adamax

best, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
…

seed the pseudo random number generator

seed(1)

define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

define the total iterations

n_iter = 60

steps size

alpha = 0.02

factor for average gradient

beta1 = 0.8

factor for average squared gradient

beta2 = 0.99

perform the gradient descent search with adamax

best, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
At the end of the run, we will report the best solution found.

…

summarize the result

print(‘Done!’)
print(‘f(%s) = %f’ % (best, score))
1
2
3
4
…

summarize the result

print(‘Done!’)
print(‘f(%s) = %f’ % (best, score))
Tying all of this together, the complete example of AdaMax gradient descent applied to our test problem is listed below.

gradient descent optimization with adamax for a two-dimensional test function

from numpy import asarray
from numpy.random import rand
from numpy.random import seed

objective function

def objective(x, y):
return x2.0 + y2.0

derivative of objective function

def derivative(x, y):
return asarray([x * 2.0, y * 2.0])

gradient descent algorithm with adamax

def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# initialize moment vector and weighted infinity norm
m = [0.0 for _ in range(bounds.shape[0])]
u = [0.0 for _ in range(bounds.shape[0])]
# run iterations of gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(x.shape[0]):
# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
# u(t) = max(beta2 * u(t-1), abs(g(t)))
u[i] = max(beta2 * u[i], abs(g[i]))
# step_size(t) = alpha / (1 - beta1(t))
step_size = alpha / (1.0 - beta1**(t+1))
# delta(t) = m(t) / u(t)
delta = m[i] / u[i]
# x(t) = x(t-1) - step_size(t) * delta(t)
x[i] = x[i] - step_size * delta
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print(‘>%d f(%s) = %.5f’ % (t, x, score))
return [x, score]

seed the pseudo random number generator

seed(1)

define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

define the total iterations

n_iter = 60

steps size

alpha = 0.02

factor for average gradient

beta1 = 0.8

factor for average squared gradient

beta2 = 0.99

perform the gradient descent search with adamax

best, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

summarize the result

print(‘Done!’)
print(‘f(%s) = %f’ % (best, score))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

gradient descent optimization with adamax for a two-dimensional test function

from numpy import asarray
from numpy.random import rand
from numpy.random import seed

objective function

def objective(x, y):
return x2.0 + y2.0

derivative of objective function

def derivative(x, y):
return asarray([x * 2.0, y * 2.0])

gradient descent algorithm with adamax

def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# initialize moment vector and weighted infinity norm
m = [0.0 for _ in range(bounds.shape[0])]
u = [0.0 for _ in range(bounds.shape[0])]
# run iterations of gradient descent
for t in range(n_iter):
# calculate gradient g(t)
g = derivative(x[0], x[1])
# build a solution one variable at a time
for i in range(x.shape[0]):
# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
# u(t) = max(beta2 * u(t-1), abs(g(t)))
u[i] = max(beta2 * u[i], abs(g[i]))
# step_size(t) = alpha / (1 - beta1(t))
step_size = alpha / (1.0 - beta1**(t+1))
# delta(t) = m(t) / u(t)
delta = m[i] / u[i]
# x(t) = x(t-1) - step_size(t) * delta(t)
x[i] = x[i] - step_size * delta
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print(‘>%d f(%s) = %.5f’ % (t, x, score))
return [x, score]

seed the pseudo random number generator

seed(1)

define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

define the total iterations

n_iter = 60

steps size

alpha = 0.02

factor for average gradient

beta1 = 0.8

factor for average squared gradient

beta2 = 0.99

perform the gradient descent search with adamax

best, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

summarize the result

print(‘Done!’)
print(‘f(%s) = %f’ % (best, score))
Running the example applies the optimization algorithm with AdaMax to our test problem and reports the performance of the search for each iteration of the algorithm.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 35 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

…

33 f([-0.00122185 0.00427944]) = 0.00002
34 f([-0.00045147 0.00289913]) = 0.00001
35 f([0.00022176 0.00165754]) = 0.00000
36 f([0.00073314 0.00058534]) = 0.00000
37 f([ 0.00105092 -0.00030082]) = 0.00000
38 f([ 0.00117382 -0.00099624]) = 0.00000
39 f([ 0.00112512 -0.00150609]) = 0.00000
40 f([ 0.00094497 -0.00184321]) = 0.00000
41 f([ 0.00068206 -0.002026 ]) = 0.00000
42 f([ 0.00038579 -0.00207647]) = 0.00000
43 f([ 9.99977780e-05 -2.01849176e-03]) = 0.00000
44 f([-0.00014145 -0.00187632]) = 0.00000
45 f([-0.00031698 -0.00167338]) = 0.00000
46 f([-0.00041753 -0.00143134]) = 0.00000
47 f([-0.00044531 -0.00116942]) = 0.00000
48 f([-0.00041125 -0.00090399]) = 0.00000
49 f([-0.00033193 -0.00064834]) = 0.00000
Done!
f([-0.00033193 -0.00064834]) = 0.000001
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
…
33 f([-0.00122185 0.00427944]) = 0.00002
34 f([-0.00045147 0.00289913]) = 0.00001
35 f([0.00022176 0.00165754]) = 0.00000
36 f([0.00073314 0.00058534]) = 0.00000
37 f([ 0.00105092 -0.00030082]) = 0.00000
38 f([ 0.00117382 -0.00099624]) = 0.00000
39 f([ 0.00112512 -0.00150609]) = 0.00000
40 f([ 0.00094497 -0.00184321]) = 0.00000
41 f([ 0.00068206 -0.002026 ]) = 0.00000
42 f([ 0.00038579 -0.00207647]) = 0.00000
43 f([ 9.99977780e-05 -2.01849176e-03]) = 0.00000
44 f([-0.00014145 -0.00187632]) = 0.00000
45 f([-0.00031698 -0.00167338]) = 0.00000
46 f([-0.00041753 -0.00143134]) = 0.00000
47 f([-0.00044531 -0.00116942]) = 0.00000
48 f([-0.00041125 -0.00090399]) = 0.00000
49 f([-0.00033193 -0.00064834]) = 0.00000
Done!
f([-0.00033193 -0.00064834]) = 0.000001