Why does batch normalization work and what is batch normalization?

The fact that the distribution of each layer’s inputs changes during training as the parameters of the previous layers change complicates Deep Neural Network training.

The goal is to normalise each layer’s inputs so that their mean output activation is zero and their standard deviation is one.

This is done at each layer for each every mini-batch, i.e. compute the mean and variance of that mini-batch separately, then normalise.

This is similar to the way network inputs are standardised.