## What is Gradient Descent?

The million-dollar question!

Let’s say you are playing a game where the players are at the top of a mountain, and they are asked to reach the lowest point of the mountain. Additionally, they are blindfolded. So, what approach do you think would make you reach the lake?

Take a moment to think about this before you read on.

The best way is to observe the ground and find where the land descends. From that position, take a step in the descending direction and iterate this process until we reach the lowest point.

Finding the lowest point in a hilly landscape. (Source: Fisseha Berhane)

Gradient descent is an iterative optimization algorithm for finding the local minimum of a function.

To find the local minimum of a function using gradient descent, we must take steps proportional to the negative of the gradient (move away from the gradient) of the function at the current point. If we take steps proportional to the positive of the gradient (moving towards the gradient), we will approach a local maximum of the function, and the procedure is called **Gradient Ascent.**

Gradient descent was originally proposed by **CAUCHY** in 1847. It is also known as steepest descent.

Source: Clairvoyant

The goal of the gradient descent algorithm is to minimize the given function (say cost function). To achieve this goal, it performs two steps iteratively:

**Compute the gradient**(slope), the first order derivative of the function at that point**Make a step (move) in the direction opposite to the gradient**, opposite direction of slope increase from the current point by alpha times the gradient at that point

Source: Coursera

Alpha is called **Learning rate** – a tuning parameter in the optimization process. It decides the length of the steps.

## Plotting the Gradient Descent Algorithm

When we have a single parameter (theta), we can plot the dependent variable cost on the y-axis and theta on the x-axis. If there are two parameters, we can go with a 3-D plot, with cost on one axis and the two parameters (thetas) along the other two axes.

```
cost along z-axis and parameters(thetas) along x-axis and y-axis (source: Research gate)
```

It can also be visualized by using **Contours.** This shows a 3-D plot in two dimensions with parameters along both axes and the response as a contour. The value of the response increases away from the center and has the same value along with the rings. The response is directly proportional to the distance of a point from the center (along a direction).

Gradient descent using Contour Plot. (source: Coursera )

### Alpha – The Learning Rate

We have the direction we want to move in, now we must decide the size of the step we must take.

***It must be chosen carefully to end up with local minima.**

- If the learning rate is too high, we might
**OVERSHOOT**the minima and keep bouncing, without reaching the minima - If the learning rate is too small, the training might turn out to be too long

Source: Coursera

- a) Learning rate is optimal, model converges to the minimum
- b) Learning rate is too small, it takes more time but converges to the minimum
- c) Learning rate is higher than the optimal value, it overshoots but converges ( 1/C < η <2/C)
- d) Learning rate is very large, it overshoots and diverges, moves away from the minima, performance decreases on learning

Source: researchgate

**Note:** As the gradient decreases while moving towards the local minima, the size of the step decreases. So, the learning rate (alpha) can be constant over the optimization and need not be varied iteratively.

### Local Minima

The cost function may consist of many minimum points. The gradient may settle on any one of the minima, which depends on the initial point (i.e initial parameters(theta)) and the learning rate. Therefore, the optimization may converge to different points with different starting points and learning rate.

```
Convergence of cost function with different starting points (Source: Gfycat )
```

## Code Implementation of Gradient Descent in Python

```
Gradient Descent Algorithm
```

## End Notes

Once we tune the learning parameter (alpha) and get the optimal learning rate, we start iterating until we converge to the local minima.