Predictive modeling problems involve making a prediction from an example of input.
A numeric quantity must be predicted in the case of a regression problem, whereas a class label must be predicted in the case of a classification problem.
The problem of predictive modeling is sufficiently challenging that we cannot write code to make predictions. Instead, we must use a learning algorithm applied to historical data to learn a “program” called a predictive model that we can use to make predictions on new data.
In statistical learning, a statistical perspective on machine learning, the problem is framed as the learning of a mapping function (f) given examples of input data (X) and associated output data (y).
- y = f(X)
Given new examples of input (Xhat), we must map each example onto the expected output value (yhat) using our learned function (fhat).
- yhat = fhat(Xhat)
The learned mapping will be imperfect. No model is perfect, and some prediction error is expected given the difficulty of the problem, noise in the observed data, and the choice of learning algorithm.
Mathematically, learning algorithms solve the problem of approximating the mapping function by solving a function optimization problem.
Specifically, given examples of inputs and outputs, find the set of inputs to the mapping function that results in the minimum loss, minimum cost, or minimum prediction error.
The more biased or constrained the choice of mapping function, the easier the optimization is to solve.
Let’s look at some examples to make this clear.
A linear regression (for regression problems) is a highly constrained model and can be solved analytically using linear algebra. The inputs to the mapping function are the coefficients of the model.
We can use an optimization algorithm, like a quasi-Newton local search algorithm, but it will almost always be less efficient than the analytical solution.
- Linear Regression: Function inputs are model coefficients, optimization problems that can be solved analytically.
A logistic regression (for classification problems) is slightly less constrained and must be solved as an optimization problem, although something about the structure of the optimization function being solved is known given the constraints imposed by the model.
This means a local search algorithm like a quasi-Newton method can be used. We could use a global search like stochastic gradient descent, but it will almost always be less efficient.
- Logistic Regression: Function inputs are model coefficients, optimization problems that require an iterative local search algorithm.
A neural network model is a very flexible learning algorithm that imposes few constraints. The inputs to the mapping function are the network weights. A local search algorithm cannot be used given the search space is multimodal and highly nonlinear; instead, a global search algorithm must be used.
A global optimization algorithm is commonly used, specifically stochastic gradient descent, and the updates are made in a way that is aware of the structure of the model (backpropagation and the chain rule). We could use a global search algorithm that is oblivious of the structure of the model, like a genetic algorithm, but it will almost always be less efficient.
- Neural Network: Function inputs are model weights, optimization problems that require an iterative global search algorithm.
We can see that each algorithm makes different assumptions about the form of the mapping function, which influences the type of optimization problem to be solved.
We can also see that the default optimization algorithm used for each machine learning algorithm is not arbitrary; it represents the most efficient algorithm for solving the specific optimization problem framed by the algorithm, e.g. stochastic gradient descent for neural nets instead of a genetic algorithm. Deviating from these defaults requires a good reason.
Not all machine learning algorithms solve an optimization problem. A notable example is the k-nearest neighbors algorithm that stores the training dataset and does a lookup for the k best matches to each new example in order to make a prediction.