Baseline Models and Voting

Before we dive into developing growing and pruning ensembles, let’s first establish a dataset and baseline.

We will use a synthetic binary classification problem as the basis for this investigation, defined by the make_classification() function with 5,000 examples and 20 numerical input features.

The example below defines the dataset and summarizes its size.

test classification dataset

from sklearn.datasets import make_classification

define dataset

X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=10, random_state=1)

summarize the dataset

print(X.shape, y.shape)
1
2
3
4
5
6

test classification dataset

from sklearn.datasets import make_classification

define dataset

X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=10, random_state=1)

summarize the dataset

print(X.shape, y.shape)
Running the example creates the dataset in a repeatable manner and reports the number of rows and input features, matching our expectations.

(5000, 20) (5000,)
1
(5000, 20) (5000,)
Next, we can choose some candidate models that will provide the basis for our ensemble.

We will use five standard machine learning models, including logistic regression, naive Bayes, decision tree, support vector machine, and a k-nearest neighbor algorithm.

First, we can define a function that will create each model with default hyperparameters. Each model will be defined as a tuple with a name and the model object, then added to a list. This is a helpful structure both for enumerating the models with their names for standalone evaluation and for later use in an ensemble.

The get_models() function below implements this and returns the list of models to consider.

get a list of models to evaluate

def get_models():
models = list()
models.append((‘lr’, LogisticRegression()))
models.append((‘knn’, KNeighborsClassifier()))
models.append((‘tree’, DecisionTreeClassifier()))
models.append((‘nb’, GaussianNB()))
models.append((‘svm’, SVC(probability=True)))
return models
1
2
3
4
5
6
7
8
9

get a list of models to evaluate

def get_models():
models = list()
models.append((‘lr’, LogisticRegression()))
models.append((‘knn’, KNeighborsClassifier()))
models.append((‘tree’, DecisionTreeClassifier()))
models.append((‘nb’, GaussianNB()))
models.append((‘svm’, SVC(probability=True)))
return models
We can then define a function that takes a single model and the dataset and evaluates the performance of the model on the dataset. We will evaluate a model using repeated stratified k-fold cross-validation with 10 folds and three repeats, a good practice in machine learning.

The evaluate_model() function below implements this and returns a list of scores across all folds and repeats.

evaluate a give model using cross-validation

def evaluate_model(model, X, y):
# define the model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
scores = cross_val_score(model, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)
return scores
1
2
3
4
5
6
7

evaluate a give model using cross-validation

def evaluate_model(model, X, y):
# define the model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
scores = cross_val_score(model, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)
return scores
We can then create the list of models and enumerate them, reporting the performance of each on the synthetic dataset in turn.

Tying this together, the complete example is listed below.

evaluate standard models on the synthetic dataset

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from matplotlib import pyplot

get the dataset

def get_dataset():
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=10, random_state=1)
return X, y

get a list of models to evaluate

def get_models():
models = list()
models.append((‘lr’, LogisticRegression()))
models.append((‘knn’, KNeighborsClassifier()))
models.append((‘tree’, DecisionTreeClassifier()))
models.append((‘nb’, GaussianNB()))
models.append((‘svm’, SVC(probability=True)))
return models

evaluate a give model using cross-validation

def evaluate_model(model, X, y):
# define the model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
scores = cross_val_score(model, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)
return scores

define dataset

X, y = get_dataset()

get the models to evaluate

models = get_models()

evaluate the models and store results

results, names = list(), list()
for name, model in models:
# evaluate model
scores = evaluate_model(model, X, y)
# store results
results.append(scores)
names.append(name)
# summarize result
print(’>%s %.3f (%.3f)’ % (name, mean(scores), std(scores)))

plot model performance for comparison

pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

evaluate standard models on the synthetic dataset

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from matplotlib import pyplot

get the dataset

def get_dataset():
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=10, random_state=1)
return X, y

get a list of models to evaluate

def get_models():
models = list()
models.append((‘lr’, LogisticRegression()))
models.append((‘knn’, KNeighborsClassifier()))
models.append((‘tree’, DecisionTreeClassifier()))
models.append((‘nb’, GaussianNB()))
models.append((‘svm’, SVC(probability=True)))
return models

evaluate a give model using cross-validation

def evaluate_model(model, X, y):
# define the model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
scores = cross_val_score(model, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)
return scores

define dataset

X, y = get_dataset()

get the models to evaluate

models = get_models()

evaluate the models and store results

results, names = list(), list()
for name, model in models:
# evaluate model
scores = evaluate_model(model, X, y)
# store results
results.append(scores)
names.append(name)
# summarize result
print(’>%s %.3f (%.3f)’ % (name, mean(scores), std(scores)))

plot model performance for comparison

pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
Running the example evaluates each standalone machine learning algorithm on the synthetic binary classification dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that both the KNN and SVM models perform the best on this dataset, achieving a mean classification accuracy of about 95.3 percent.

These results provide a baseline in performance that we require any ensemble to exceed in order to be considered useful on this dataset.

lr 0.856 (0.014)
knn 0.953 (0.008)
tree 0.867 (0.014)
nb 0.847 (0.021)
svm 0.953 (0.010)
1
2
3
4
5
lr 0.856 (0.014)
knn 0.953 (0.008)
tree 0.867 (0.014)
nb 0.847 (0.021)
svm 0.953 (0.010)
A figure is created showing box and whisker plots of the distribution of accuracy scores for each algorithm.

We can see that the KNN and SVM algorithms perform much better than the other algorithms, although all algorithms are skillful in different ways. This may make them good candidates to consider in an ensemble.

Box and Whisker Plots of Classification Accuracy for Standalone Machine Learning Models
Box and Whisker Plots of Classification Accuracy for Standalone Machine Learning Models

Next, we need to establish a baseline ensemble that uses all models. This will provide a point of comparison with growing and pruning methods that seek better performance with a smaller subset of models.

In this case, we will use a voting ensemble with soft voting. This means that each model will predict probabilities and the probabilities will be summed by the ensemble model to choose a final output prediction for each input sample.

This can be achieved using the VotingClassifier class where the members are set via the “estimators” argument, which expects a list of models where each model is a tuple with a name and configured model object, just as we defined in the previous section.

We can then set the type of voting to perform via the “voting” argument, which in this case is set to “soft.”

create the ensemble

ensemble = VotingClassifier(estimators=models, voting=‘soft’)
1
2
3

create the ensemble

ensemble = VotingClassifier(estimators=models, voting=‘soft’)
Tying this together, the example below evaluates a voting ensemble of all five models on the synthetic binary classification dataset.

example of a voting ensemble with soft voting of ensemble members

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier

get the dataset

def get_dataset():
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=10, random_state=1)
return X, y

get a list of models to evaluate

def get_models():
models = list()
models.append((‘lr’, LogisticRegression()))
models.append((‘knn’, KNeighborsClassifier()))
models.append((‘tree’, DecisionTreeClassifier()))
models.append((‘nb’, GaussianNB()))
models.append((‘svm’, SVC(probability=True)))
return models

define dataset

X, y = get_dataset()

get the models to evaluate

models = get_models()

create the ensemble

ensemble = VotingClassifier(estimators=models, voting=‘soft’)

define the evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

evaluate the ensemble

scores = cross_val_score(ensemble, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)

summarize the result

print(‘Mean Accuracy: %.3f (%.3f)’ % (mean(scores), std(scores)))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

example of a voting ensemble with soft voting of ensemble members

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier

get the dataset

def get_dataset():
X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=10, random_state=1)
return X, y

get a list of models to evaluate

def get_models():
models = list()
models.append((‘lr’, LogisticRegression()))
models.append((‘knn’, KNeighborsClassifier()))
models.append((‘tree’, DecisionTreeClassifier()))
models.append((‘nb’, GaussianNB()))
models.append((‘svm’, SVC(probability=True)))
return models

define dataset

X, y = get_dataset()

get the models to evaluate

models = get_models()

create the ensemble

ensemble = VotingClassifier(estimators=models, voting=‘soft’)

define the evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

evaluate the ensemble

scores = cross_val_score(ensemble, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)

summarize the result

print(‘Mean Accuracy: %.3f (%.3f)’ % (mean(scores), std(scores)))
Running the example evaluates the soft voting ensemble of all models using repeated stratified k-fold cross-validation and reports the mean accuracy across all folds and repeats.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the voting ensemble achieved a mean accuracy of about 92.8 percent. This is lower than SVM and KNN models used alone that achieved an accuracy of about 95.3 percent.

This result highlights that a simple voting ensemble of all models results in a model with higher complexity and worse performance in this case. Perhaps we can find a subset of members that performs better than any single model and has lower complexity than simply using all models.

Mean Accuracy: 0.928 (0.012)
1
Mean Accuracy: 0.928 (0.012)