What Is the Difference Between Batch Gradient Descent and Stochastic Gradient Descent?
a) In SGD, before for-looping, you need to randomly shuffle the training examples.
b) In SGD, because it’s using only one example at a time, its path to the minima is noisier (more random) than that of the batch gradient. But it’s ok as we are indifferent to the path, as long as it gives us the minimum AND the shorter training time.
c) Mini-batch gradient descent uses n data points (instead of 1 sample in SGD) at each iteration.
In mini-batch gradient descent you update your model parameters after accumulating gradient generated by all samples in dataset and the do the update.
In mini-batch you take a small subset of whole dataset and then calculate gradient generated by all sample but only in that mini-batch (subset of dataset) and then do parameter update.
Generally mini-batch is faster because many samples have more or less similar gradient vector. So it seems is not much useful to calculate gradient for all samples.
A mini-batch which is big enough to approximately capture while dataset is enough to get a good estimate of gradient.
And mini-batch is also less prone to overfitting than compared to batch .since, updates are random as other have already pointed out.