How can you choose a classifier based on a training set size?

priyanka-gaikwad-9f6e5281 · 11 August 2020 12:26

Choosing a classifier

ruble-joseph · 14 August 2020 08:18

If the training set is small, high bias / low variance models (e.g. Naive Bayes) tend to perform better because they are less likely to overfit.

If the training set is large, low bias / high variance models (e.g. Logistic Regression) tend to perform better because they can reflect more complex relationships.

Unsupervised Learning

pallav-goswami · 10 February 2022 17:24

To elaborate further to what Mr. Ruble has mentioned:
Choosing a classification algorithm or any algorithm for that matter in Supervised Machine learning domain has to do with Bias Variance tradeoff and it’s a central issue to it.

Bias is defined as error is an error from erroneous assumptions in the learning algorithm . High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
Where as,
Variance is defined as is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).

So expected behavior of some common classification algorithms provided similar conditions be like:

Algorithm	Bias	Variance
Naive Bayes	High Bias	Low Variance
Logistic Regression	Low Bias	High Variance
Decision Tree	Low Bias	High Variance
Bagging	Low Bias	High Variance, lesser than Decision tree
Random Forest	Low Bias	High Variance, lesser than Decision tree and Bagging

So in essence if the choice is based on data set size then go with models with High Bias and Low Variance in case of lesser data and with high number of data points we can experiment with other classification algo’s since that’ll give better result