Combining Classification Predictions - 1

Classification refers to redictive modeling problems that involve predicting a class label given an input.

The prediction made by a model may be a crisp class label directly or may be a probability that an example belongs to each class, referred to as the probability of class membership.

The performance of a classification problem is often measured using accuracy or a related count or ratio of correct predictions. In the case of evaluating predicted probabilities, they may be converted to crisp class labels by selecting a cut-off threshold, or evaluated using specialized metrics such as cross-entropy.

Combining Predicted Class Labels

A predicted class label is often mapped to something meaningful to the problem domain.

For example, a model may predict a color such as “red” or “green“. Internally though, the model predicts a numerical representation for the class label such as 0 for “red“, 1 for “green“, and 2 for “blue” for our color classification example.

Methods for combining class labels are perhaps easier to consider if we work with the integer encoded class labels directly.

Perhaps the simplest, most common, and often most effective approach is to combine the predictions by voting.

Voting is the most popular and fundamental combination method for nominal outputs.

— Page 71, Ensemble Methods, 2012.

Voting generally involves each model that makes a prediction assigning a vote for the class that was predicted. The votes are tallied and an outcome is then chosen using the votes or tallies in some way.

There are many types of voting, so let’s look at the four most common:

  • Plurality Voting.
  • Majority Voting.
  • Unanimous Voting.
  • Weighted Voting.

Simple voting, called plurality voting, selects the class label with the most votes.

If two or more classes have the same number of votes, then the tie is broken arbitrarily, although in a consistent manner, such as sorting the class labels that have a tie and selecting the first, instead of selecting one randomly. This is important so that the same model with the same data always makes the same prediction.

Given ties, it is common to have an odd number of ensemble members in an attempt to automatically break ties, as opposed to an even number of ensemble members where ties may be more likely.

From a statistical perspective, this is called the mode or the most common value from the collection of predictions.

For example, consider the three predictions made by a model for a three-class color prediction problem:

  • Model 1 predicts “green” or 1.
  • Model 2 predicts “green” or 1.
  • Model 3 predicts “red” or 0.

The votes are, therefore:

  • Red Votes: 1
  • Green Votes: 2
  • Blue Votes: 0

The prediction would be “green” given it has the most votes.

Majority voting selects the class label that has more than half the votes. If no class has more than half the votes, then a “no prediction” is made. Interestingly, majority voting can be proven to be an optimal method for combining classifiers, if they are independent.

If the classifier outputs are independent, then it can be shown that majority voting is the optimal combination rule.

— Page 1, Ensemble Machine Learning, 2012.

Unanimous voting is related to majority voting in that instead of requiring half the votes, the method requires all models to predict the same value, otherwise, no prediction is made.

Weighted voting weighs the prediction made by each model in some way. One example would be to weigh predictions based on the average performance of the model, such as classification accuracy.

The weight of each classifier can be set proportional to its accuracy performance on a validation set.

— Page 67, Pattern Classification Using Ensemble Methods, 2010.

Assigning weights to classifiers can become a project in and of itself and could involve using an optimization algorithm and a holdout dataset, a linear model, or even another machine learning model entirely.

So, how do we assign the weights? If we knew, a priori, which classifiers would work better, we would only use those classifiers. In the absence of such information, a plausible and commonly used strategy is to use the performance of a classifier on a separate validation (or even training) dataset, as an estimate of that classifier’s generalization performance.

— Page 8, Ensemble Machine Learning, 2012.

The idea of weighted voting is that some classifiers are more likely to be accurate than others and we should reward them by giving them a larger share of the votes.