Stochastic & Batch Gradient Descent

Vanilla Gradient Descent updates the weights after every epoch, which means that in essence it takes the average loss of all the iterations of training instances and then updates the weights at the end of the epoch. This is not ideal as it might not capture details, hence Stochastic GD updates the weights with the loss obtained in every iteration in every epoch. That’s a lot of updates! So this makes the optimization curve noisy and time consuming as well.

Batch Gradient Descent is sort of a middle ground between Vanilla and Stochastic. It forms batches of the complete dataset and then updates the weights at the end of every batch. This not only makes the optimization better and faster but also helps when dataset is huge and you cannot load all of it at once.

Batch GD is used in most of the Deep Learning applications which usually work on huge datasets and even bigger and complex neural networks. Dividing 100000 data points into batches of, say, 32 makes sure that you don’t overload your RAM and crash the training job!

#datascience #ml #machinelearning #deeplearning

Batch Gradient Descent :- It uses the whole batch of training data at every step. As a result it is terribly slow on very large training datasets ( 1 Million records ).To find a good learning rate, you can use grid search here. However, you may want to limit the number of iterations so that grid search can eliminate models that take too long to converge.

Stochastic Gradient Descent :- It just picks a random instance in the training set at every step and computes the gradients based on that single instance. Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration.

It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration. On the other hand, due to its random nature, this algorithm is much less regular than Batch Gradient Descent.