Vanilla Gradient Descent updates the weights after every epoch, which means that in essence it takes the average loss of all the iterations of training instances and then updates the weights at the end of the epoch. This is not ideal as it might not capture details, hence Stochastic GD updates the weights with the loss obtained in every iteration in every epoch. That’s a lot of updates! So this makes the optimization curve noisy and time consuming as well.
Batch Gradient Descent is sort of a middle ground between Vanilla and Stochastic. It forms batches of the complete dataset and then updates the weights at the end of every batch. This not only makes the optimization better and faster but also helps when dataset is huge and you cannot load all of it at once.
Batch GD is used in most of the Deep Learning applications which usually work on huge datasets and even bigger and complex neural networks. Dividing 100000 data points into batches of, say, 32 makes sure that you don’t overload your RAM and crash the training job!
#datascience #ml #machinelearning #deeplearning