Addressing Premature Convergence in ML

Premature convergence may be a relevant concern on any reasonably challenging optimization task.

For example, a majority of research into the field of evolutionary computation and genetic algorithms involves identifying and overcoming the premature convergence of the algorithm on an optimization task.

If selection focuses on the most-fit individuals, the selection pressure may cause premature convergence due to reduced diversity of the new populations.

— Page 139, Computational Intelligence: An Introduction, 2nd edition, 2007.

Population-based optimization algorithms, like evolutionary algorithms and swarm intelligence, often describe their dynamics in terms of the interplay between selective pressures and convergence. For example, strong selective pressures result in faster convergence and likely premature convergence. Weaker selective pressures may result in a slower convergence (greater computational cost) although perhaps locate a better or even global optima.

An operator with a high selective pressure decreases diversity in the population more rapidly than operators with a low selective pressure, which may lead to premature convergence to suboptimal solutions. A high selective pressure limits the exploration abilities of the population.

— Page 135, Computational Intelligence: An Introduction, 2nd edition, 2007.

This idea of selective pressure is helpful more generally in understanding the learning dynamics of optimization algorithms. For example, an optimization that is configured to be too greedy (e.g. via hyperparameters such as the step size or learning rate) may fail due to premature convergence, whereas the same algorithm that is configured to be less greedy may overcome premature convergence and discover a better or globally optimal solution.

Premature convergence may be encountered when using stochastic gradient descent to train a neural network model, signified by a learning curve that drops exponentially quickly then stops improving.

The number of updates required to reach convergence usually increases with training set size. However, as m approaches infinity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set.

— Page 153, Deep Learning, 2016.

The fact that fitting neural networks are subject to premature convergence motivates the use of methods such as learning curves to monitor and diagnose issues with the convergence of a model on a training dataset, and the use of regularization, such as early stopping, that halts the optimization algorithm prior to finding a stable point comes at the expense of worse performance on a holdout dataset.

As such, much research into deep learning neural networks is ultimately directed at overcoming premature convergence.

Empirically, it is often found that ‘tanh’ activation functions give rise to faster convergence of training algorithms than logistic functions.

— Page 127, Neural Networks for Pattern Recognition, 1995.

This includes techniques such as work on weight initialization, which is critical because the initial weights of a neural network define the starting point of the optimization process, and poor initialization can lead to premature convergence.

The initial point can determine whether the algorithm converges at all, with some initial points being so unstable that the algorithm encounters numerical difficulties and fails altogether.

— Page 301, Deep Learning, 2016.

This also includes the vast number of variations and extensions of the stochastic gradient descent optimization algorithm, such as the addition of momentum so that the algorithm does not overshoot the optima (stable point), and Adam that adds an automatically adapted step size hyperparameter (learning rate) for each parameter that is being optimized, dramatically speeding up convergence.