Gradient Descent

In my last post I talked about how we iterate to approximate the Ordinary Least Squares solution. This is where Gradient Descent comes in. (It’s decent enough :p)

Gradient Descent is the numerical approximation of the OLS solution which intends to find the optimal solution to the loss function.

However, Gradient Descent can be applied to any optimization problem, be it ML or DL, as it arrives at the global minima by iteratively moving towards it. It starts at a random point (which can rather also be initialized with some specific values using certain initializers) and iteratively moves towards the direction where the loss decreases.

After every iteration it calculates the loss and updates the weights of the loss function. The magnitude of update is controlled by a parameter called the Learning Rate, alpha. Higher the value of alpha, bigger are the update steps and convergence might be quicker.

It’s a method for optimizing continuous functions.

It is an iterative method in which one begins at an arbitrary point in the domain, computes the direction of the gradient at that point, and then increments the test point in that direction (or opposite that direction depending on whether you are maximizing or minimizing the function). Repeat until sufficiently close to a local optimum.

Different methods exist for deciding how large of a step to take. A straightforward approach is to do a line search for a local optimum along the step direction and step directly to that local optimum. Another method entails using a fixed sequence of step sizes that decay exponentially.

Gradient descent is famously used to improve machine learning models such as neural networks.