Hyperparameter Tuning for KNORA - 1

swapneel-panda-419bc751 · 15 June 2021 07:56

Explore k in k-Nearest Neighbors
The configuration of the k-nearest neighbors algorithm is critical to the KNORA model as it defines the scope of the neighborhood in which each ensemble is considered for selection.

The k value controls the size of the neighborhood and it is important to set it to a value that is appropriate for your dataset, specifically the density of samples in the feature space. A value too small will mean that relevant examples in the training set might be excluded from the neighborhood, whereas values too large may mean that the signal is being washed out by too many examples.

The code example below explores the classification accuracy of the KNORA-U algorithm with k values from 2 to 21.

explore k in knn for KNORA-U dynamic ensemble selection

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from deslib.des.knora_u import KNORAU
from matplotlib import pyplot

get the dataset

def get_dataset():
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y

get a list of models to evaluate

def get_models():
models = dict()
for n in range(2,22):
models[str(n)] = KNORAU(k=n)
return models

evaluate a give model using cross-validation

def evaluate_model(model):
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)
return scores

define dataset

X, y = get_dataset()

get the models to evaluate

models = get_models()

evaluate the models and store results

results, names = list(), list()
for name, model in models.items():
scores = evaluate_model(model)
results.append(scores)
names.append(name)
print(‘>%s %.3f (%.3f)’ % (name, mean(scores), std(scores)))

plot model performance for comparison

pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

explore k in knn for KNORA-U dynamic ensemble selection

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from deslib.des.knora_u import KNORAU
from matplotlib import pyplot

get the dataset

def get_dataset():
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
return X, y

get a list of models to evaluate

def get_models():
models = dict()
for n in range(2,22):
models[str(n)] = KNORAU(k=n)
return models

evaluate a give model using cross-validation

def evaluate_model(model):
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)
return scores

define dataset

X, y = get_dataset()

get the models to evaluate

models = get_models()

evaluate the models and store results

results, names = list(), list()
for name, model in models.items():
scores = evaluate_model(model)
results.append(scores)
names.append(name)
print(‘>%s %.3f (%.3f)’ % (name, mean(scores), std(scores)))

plot model performance for comparison

pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
Running the example first reports the mean accuracy for each configured neighborhood size.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that accuracy increases with the neighborhood size, perhaps to k=10, where it appears to level off.

2 0.933 (0.008)
3 0.933 (0.010)
4 0.935 (0.011)
5 0.935 (0.007)
6 0.937 (0.009)
7 0.935 (0.011)
8 0.937 (0.010)
9 0.936 (0.009)
10 0.938 (0.007)
11 0.935 (0.010)
12 0.936 (0.009)
13 0.934 (0.009)
14 0.937 (0.009)
15 0.938 (0.009)
16 0.935 (0.010)
17 0.938 (0.008)
18 0.936 (0.007)
19 0.934 (0.007)
20 0.935 (0.007)
21 0.936 (0.009)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2 0.933 (0.008)
3 0.933 (0.010)
4 0.935 (0.011)
5 0.935 (0.007)
6 0.937 (0.009)
7 0.935 (0.011)
8 0.937 (0.010)
9 0.936 (0.009)
10 0.938 (0.007)
11 0.935 (0.010)
12 0.936 (0.009)
13 0.934 (0.009)
14 0.937 (0.009)
15 0.938 (0.009)
16 0.935 (0.010)
17 0.938 (0.008)
18 0.936 (0.007)
19 0.934 (0.007)
20 0.935 (0.007)
21 0.936 (0.009)
A box and whisker plot is created for the distribution of accuracy scores for each configured neighborhood size.

We can see the general trend of increasing model performance and k value before reaching a plateau.