Sensitivity Analysis of Dataset Size

There are many ways to perform a sensitivity analysis, but perhaps the simplest approach is to define a test harness to evaluate model performance and then evaluate the same model on the same problem with differently sized datasets.

This will allow the train and test portions of the dataset to increase with the size of the overall dataset.

To make the code easier to read, we will split it up into functions.

First, we can define a function that will prepare (or load) the dataset of a given size. The number of rows in the dataset is specified by an argument to the function.

If you are using this code as a template, this function can be changed to load your dataset from file and select a random sample of a given size.

load dataset

def load_dataset(n_samples):
# define the dataset
X, y = make_classification(n_samples=int(n_samples), n_features=20, n_informative=15, n_redundant=5, random_state=1)
return X, y

Next, we need a function to evaluate a model on a loaded dataset.

We will define a function that takes a dataset and returns a summary of the performance of the model evaluated using the test harness on the dataset.

This function is listed below, taking the input and output elements of a dataset and returning the mean and standard deviation of the decision tree model on the dataset.

evaluate a model

def evaluate_model(X, y):
# define model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define model
model = DecisionTreeClassifier()
# evaluate model
scores = cross_val_score(model, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)
# return summary stats
return [scores.mean(), scores.std()]
Next, we can define a range of different dataset sizes to evaluate.

The sizes should be chosen proportional to the amount of data you have available and the amount of running time you are willing to expend.

In this case, we will keep the sizes modest to limit running time, from 50 to one million rows on a rough log10 scale.

define number of samples to consider

sizes = [50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]
Next, we can enumerate each dataset size, create the dataset, evaluate a model on the dataset, and store the results for later analysis.

evaluate each number of samples

means, stds = list(), list()
for n_samples in sizes:
# get a dataset
X, y = load_dataset(n_samples)
# evaluate a model on this dataset size
mean, std = evaluate_model(X, y)
# store
means.append(mean)
stds.append(std)
Next, we can summarize the relationship between the dataset size and model performance.

In this case, we will simply plot the result with error bars so we can spot any trends visually.

We will use the standard deviation as a measure of uncertainty on the estimated model performance. This can be achieved by multiplying the value by 2 to cover approximately 95% of the expected performance if the performance follows a normal distribution.

This can be shown on the plot as an error bar around the mean expected performance for a dataset size.

define error bar as 2 standard deviations from the mean or 95%

err = [min(1, s * 2) for s in stds]

plot dataset size vs mean performance with error bars

pyplot.errorbar(sizes, means, yerr=err, fmt=’-o’)
To make the plot more readable, we can change the scale of the x-axis to log, given that our dataset sizes are on a rough log10 scale.

change the scale of the x-axis to log

ax = pyplot.gca()
ax.set_xscale(“log”, nonpositive=‘clip’)

show the plot

pyplot.show()
And that’s it.

We would generally expect mean model performance to increase with dataset size. We would also expect the uncertainty in model performance to decrease with dataset size.

Tying this all together, the complete example of performing a sensitivity analysis of dataset size on model performance is listed below.

sensitivity analysis of model performance to dataset size

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from matplotlib import pyplot

load dataset

def load_dataset(n_samples):
# define the dataset
X, y = make_classification(n_samples=int(n_samples), n_features=20, n_informative=15, n_redundant=5, random_state=1)
return X, y

evaluate a model

def evaluate_model(X, y):
# define model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define model
model = DecisionTreeClassifier()
# evaluate model
scores = cross_val_score(model, X, y, scoring=‘accuracy’, cv=cv, n_jobs=-1)
# return summary stats
return [scores.mean(), scores.std()]

define number of samples to consider

sizes = [50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]

evaluate each number of samples

means, stds = list(), list()
for n_samples in sizes:
# get a dataset
X, y = load_dataset(n_samples)
# evaluate a model on this dataset size
mean, std = evaluate_model(X, y)
# store
means.append(mean)
stds.append(std)
# summarize performance
print(’>%d: %.3f (%.3f)’ % (n_samples, mean, std))

define error bar as 2 standard deviations from the mean or 95%

err = [min(1, s * 2) for s in stds]

plot dataset size vs mean performance with error bars

pyplot.errorbar(sizes, means, yerr=err, fmt=’-o’)

change the scale of the x-axis to log

ax = pyplot.gca()
ax.set_xscale(“log”, nonpositive=‘clip’)

show the plot

pyplot.show()
Running the example reports the status along the way of dataset size vs. estimated model performance.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the expected trend of increasing mean model performance with dataset size and decreasing model variance measured using the standard deviation of classification accuracy.

We can see that there is perhaps a point of diminishing returns in estimating model performance at perhaps 10,000 or 50,000 rows.

Specifically, we do see an improvement in performance with more rows, but we can probably capture this relationship with little variance with 10K or 50K rows of data.

We can also see a drop-off in estimated performance with 1,000,000 rows of data, suggesting that we are probably maxing out the capability of the model above 100,000 rows and are instead measuring statistical noise in the estimate.

This might mean an upper bound on expected performance and likely that more data beyond this point will not improve the specific model and configuration on the chosen test harness.

50: 0.673 (0.141)
100: 0.703 (0.135)
500: 0.809 (0.055)
1000: 0.826 (0.044)
5000: 0.835 (0.016)
10000: 0.866 (0.011)
50000: 0.900 (0.005)
100000: 0.912 (0.003)
500000: 0.938 (0.001)
1000000: 0.936 (0.001)