stochastic gradient descent

The word ‘**stochastic**‘ means a system or a process that is linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent optimization technique, you will have to use all of the one million samples for completing one iteration while performing the Gradient Descent, and it has to be done for every iteration until the minima are reached. Hence, it becomes computationally very expensive to perform.

This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration.

Basically, instead of summing over all the training examples and then computing the gradient like in ordinary gradient descent, in stochastic gradient descent, the gradient is computed and the weights are updated for each training example. You can think of this as taking small steps down the slope, at each training example, which is far more efficient than ordinary gradient descent (which is not practical for large datasets). Here is pseudo-code to illustrate this:

while (iter < maxIter)

{

for i = 1… # of training examples

{

compute gradient

update weights

}

iter++

}

Note that stochastic gradient descent, as the name suggests, is stochastic. That means that the minima where it ends up is completely random and depends on the weight initialization. Usually it takes a lot longer for stochastic gradient descent to converge to a global minimum than gradient descent, but gives a good solution much faster, for obvious reasons.