In this section, we will explore how to develop a greedy ensemble growing algorithm from scratch.

The structure of greedily growing an ensemble is much like the greedy pruning of members, although in reverse. We start with an ensemble with no models and add a single model that has the best performance. Models are then added one by one only if they result in a lift in performance over the ensemble before the model was added.

Much of the code is the same as the pruning case so we can focus on the differences.

First, we must define a function to perform one round of growing the ensemble. This involves enumerating all candidate models that could be added and evaluating the effect of adding each in turn to the ensemble. The single model that results in the biggest improvement is then returned by the function, along with its score.

The grow_round() function below implements this behavior.

# perform a single round of growing the ensemble

def grow_round(models_in, models_candidate, X, y):

# establish a baseline

baseline = evaluate_ensemble(models_in, X, y)

best_score, addition = baseline, None

# enumerate adding each candidate and see if we can improve performance

for m in models_candidate:

# copy the list of chosen models

dup = models_in.copy()

# add the candidate

dup.append(m)

# evaluate new ensemble

result = evaluate_ensemble(dup, X, y)

# check for new best

if result > best_score:

# store the new best

best_score, addition = result, m

return best_score, addition

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

# perform a single round of growing the ensemble

def grow_round(models_in, models_candidate, X, y):

# establish a baseline

baseline = evaluate_ensemble(models_in, X, y)

best_score, addition = baseline, None

# enumerate adding each candidate and see if we can improve performance

for m in models_candidate:

# copy the list of chosen models

dup = models_in.copy()

# add the candidate

dup.append(m)

# evaluate new ensemble

result = evaluate_ensemble(dup, X, y)

# check for new best

if result > best_score:

# store the new best

best_score, addition = result, m

return best_score, addition

Next, we need a function to drive the growing procedure.

This involves a loop that runs rounds of growing until no further additions can be made resulting in an improvement in model performance. For each addition that can be made, the main list of models in the ensemble is updated and the list of models currently in the ensemble is reported along with the performance.

The grow_ensemble() function implements this and returns the list of models greedily determined to result in the best performance along with the expected mean accuracy.

# grow an ensemble from scratch

def grow_ensemble(models, X, y):

best_score, best_list = 0.0, list()

# grow ensemble until no further improvement

while True:

# add one model to the ensemble

score, addition = grow_round(best_list, models, X, y)

# check for no improvement

if addition is None:

print(’>no further improvement’)

break

# keep track of best score

best_score = score

# remove new model from the list of candidates

models.remove(addition)

# add new model to the list of models in the ensemble

best_list.append(addition)

# report results along the way

names = ‘,’.join([n for n,_ in best_list])

print(’>%.3f (%s)’ % (score, names))

return best_score, best_list

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

# grow an ensemble from scratch

def grow_ensemble(models, X, y):

best_score, best_list = 0.0, list()

# grow ensemble until no further improvement

while True:

# add one model to the ensemble

score, addition = grow_round(best_list, models, X, y)

# check for no improvement

if addition is None:

print(’>no further improvement’)

break

# keep track of best score

best_score = score

# remove new model from the list of candidates

models.remove(addition)

# add new model to the list of models in the ensemble

best_list.append(addition)

# report results along the way

names = ‘,’.join([n for n,_ in best_list])

print(’>%.3f (%s)’ % (score, names))

return best_score, best_list

Tying this together, the complete example of greedy ensemble growing on the synthetic binary classification dataset is listed below.

# example of ensemble growing for classification

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import VotingClassifier

from matplotlib import pyplot

# get the dataset

def get_dataset():

X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=10, random_state=1)

return X, y

# get a list of models to evaluate

def get_models():

models = list()

models.append((‘lr’, LogisticRegression()))

models.append((‘knn’, KNeighborsClassifier()))

models.append((‘tree’, DecisionTreeClassifier()))

models.append((‘nb’, GaussianNB()))

models.append((‘svm’, SVC(probability=True)))

return models

# evaluate a list of models

def evaluate_ensemble(models, X, y):

# check for no models

if len(models) == 0:

return 0.0

# create the ensemble

ensemble = VotingClassifier(estimators=models, voting=‘soft’)

# define the evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the ensemble

scores = cross_val_score(ensemble, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)

# return mean score

return mean(scores)

# perform a single round of growing the ensemble

def grow_round(models_in, models_candidate, X, y):

# establish a baseline

baseline = evaluate_ensemble(models_in, X, y)

best_score, addition = baseline, None

# enumerate adding each candidate and see if we can improve performance

for m in models_candidate:

# copy the list of chosen models

dup = models_in.copy()

# add the candidate

dup.append(m)

# evaluate new ensemble

result = evaluate_ensemble(dup, X, y)

# check for new best

if result > best_score:

# store the new best

best_score, addition = result, m

return best_score, addition

# grow an ensemble from scratch

def grow_ensemble(models, X, y):

best_score, best_list = 0.0, list()

# grow ensemble until no further improvement

while True:

# add one model to the ensemble

score, addition = grow_round(best_list, models, X, y)

# check for no improvement

if addition is None:

print(’>no further improvement’)

break

# keep track of best score

best_score = score

# remove new model from the list of candidates

models.remove(addition)

# add new model to the list of models in the ensemble

best_list.append(addition)

# report results along the way

names = ‘,’.join([n for n,_ in best_list])

print(’>%.3f (%s)’ % (score, names))

return best_score, best_list

# define dataset

X, y = get_dataset()

# get the models to evaluate

models = get_models()

# grow the ensemble

score, model_list = grow_ensemble(models, X, y)

names = ‘,’.join([n for n,_ in model_list])

print(‘Models: %s’ % names)

print(‘Final Mean Accuracy: %.3f’ % score)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

# example of ensemble growing for classification

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import VotingClassifier

from matplotlib import pyplot

# get the dataset

def get_dataset():

X, y = make_classification(n_samples=5000, n_features=20, n_informative=10, n_redundant=10, random_state=1)

return X, y

# get a list of models to evaluate

def get_models():

models = list()

models.append((‘lr’, LogisticRegression()))

models.append((‘knn’, KNeighborsClassifier()))

models.append((‘tree’, DecisionTreeClassifier()))

models.append((‘nb’, GaussianNB()))

models.append((‘svm’, SVC(probability=True)))

return models

# evaluate a list of models

def evaluate_ensemble(models, X, y):

# check for no models

if len(models) == 0:

return 0.0

# create the ensemble

ensemble = VotingClassifier(estimators=models, voting=‘soft’)

# define the evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the ensemble

scores = cross_val_score(ensemble, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)

# return mean score

return mean(scores)

# perform a single round of growing the ensemble

def grow_round(models_in, models_candidate, X, y):

# establish a baseline

baseline = evaluate_ensemble(models_in, X, y)

best_score, addition = baseline, None

# enumerate adding each candidate and see if we can improve performance

for m in models_candidate:

# copy the list of chosen models

dup = models_in.copy()

# add the candidate

dup.append(m)

# evaluate new ensemble

result = evaluate_ensemble(dup, X, y)

# check for new best

if result > best_score:

# store the new best

best_score, addition = result, m

return best_score, addition

# grow an ensemble from scratch

def grow_ensemble(models, X, y):

best_score, best_list = 0.0, list()

# grow ensemble until no further improvement

while True:

# add one model to the ensemble

score, addition = grow_round(best_list, models, X, y)

# check for no improvement

if addition is None:

print(’>no further improvement’)

break

# keep track of best score

best_score = score

# remove new model from the list of candidates

models.remove(addition)

# add new model to the list of models in the ensemble

best_list.append(addition)

# report results along the way

names = ‘,’.join([n for n,_ in best_list])

print(’>%.3f (%s)’ % (score, names))

return best_score, best_list

# define dataset

X, y = get_dataset()

# get the models to evaluate

models = get_models()

# grow the ensemble

score, model_list = grow_ensemble(models, X, y)

names = ‘,’.join([n for n,_ in model_list])

print(‘Models: %s’ % names)

print(‘Final Mean Accuracy: %.3f’ % score)

Running the example incrementally adds one model at a time to the ensemble and reports the mean classification accuracy of the ensemble of the models.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that ensemble growing found the same solution as greedy ensemble pruning where an ensemble of SVM and KNN achieved a mean classification accuracy of about 95.6 percent, an improvement over any single standalone model and over combining all models.

0.953 (svm)

0.956 (svm,knn)

no further improvement

Models: svm,knn

Final Mean Accuracy: 0.956

1

2

3

4

5

0.953 (svm)

0.956 (svm,knn)

no further improvement

Models: svm,knn

Final Mean Accuracy: 0.956