Learning can be described or framed as an optimization problem, and most machine learning algorithms solve an optimization problem at their core.
The no free lunch theorem for optimization and search is applied to machine learning, specifically supervised learning, which underlies classification and regression predictive modeling tasks.
This means that all machine learning algorithms are equally effective across all possible prediction problems, e.g. random forest is as good as random predictions.
So all learning algorithms are the same in that: (1) by several definitions of “average”, all algorithms have the same average off-training-set misclassification risk, (2) therefore no learning algorithm can have lower risk than another one for all f …
— The Supervised Learning No-Free-Lunch Theorems, 2002.
This also has implications for the way in which algorithms are evaluated or chosen, such as choosing a learning algorithm via a k-fold cross-validation test harness or not.
… an algorithm that uses cross validation to choose amongst a prefixed set of learning algorithms does no better on average than one that does not.
— The Supervised Learning No-Free-Lunch Theorems, 2002.
It also has implications for common heuristics for what constitutes a “good” machine learning model, such as avoiding overfitting or choosing the simplest possible model that performs well.
Another set of examples is provided by all the heuristics that people have come up with for supervised learning avoid overfitting prefer simpler to more complex models etc. [no free lunch] says that all such heuristics fail as often as they succeed.
— The Supervised Learning No-Free-Lunch Theorems, 2002.
Given that there is no best single machine learning algorithm across all possible prediction problems, it motivates the need to continue to develop new learning algorithms and to better understand algorithms that have already been developed.
As a consequence of the no free lunch theorem, we need to develop many different types of models, to cover the wide variety of data that occurs in the real world. And for each model, there may be many different algorithms we can use to train the model, which make different speed-accuracy-complexity tradeoffs.
— Pages 24-25, Machine Learning: A Probabilistic Perspective, 2012.
It also supports the argument of testing a suite of different machine learning algorithms for a given predictive modeling problem.
The “No Free Lunch” Theorem argues that, without having substantive information about the modeling problem, there is no single model that will always do better than any other model. Because of this, a strong case can be made to try a wide variety of techniques, then determine which model to focus on.
— Pages 25-26, Applied Predictive Modeling, 2013.
Nevertheless, as with optimization, the implications of the theorem are based on the choice of learning algorithms having zero knowledge of the problem that is being solved.
In practice, this is not the case, and a beginner machine learning practitioner is encouraged to review the available data in order to learn something about the problem that can be incorporated into the learning algorithm.
We may even want to take this one step further and say that learning is not possible without some prior knowledge and that data alone is not enough.