How to tackle bias and variance in ML classification model

rajanikant-ghate · 20 June 2021 08:37

The area under the ROC curve explains the capability of a model to separate the response variable into either of the categories. So the model that has highest AUC-ROC has the lowest variance.

A classification model threshold (which default is 0.5) can be tuned to control bias of the model. For e.g. setting a low threshold, will lead to a biased model that predicts more positives.

chirag-garg · 23 August 2021 04:07

Machine learning model estimate some function F which maps inputs X to targets Y (labels, real values etc). Bias represents error/deviation in prediction because our estimator (F_dash) is different from actual F. A simpler F_dash deviates more from F hence has a larger bias. Please note that you can not measure absolute Bias, you can only know whether you are increasing or reducing bias.

The parameters of F_dash are estimated using some training data. This data is a sample of the population and hence any estimation method will measure some of the properties of the sample and use it to estimate the parameters. These properties are called sampling noise. The error in the model due to this is called variance in the F_dash estimator. If the estimator fits too much to the properties of the sample it will produce error in a new data. This is called over-fitting.

Bias and Variance usually will have inverse relation. Increasing one will reduce the other. For example, reducing bias will mean a more complex F_dash with more parameters which means a higher chance of noise in the estimate of the parameters leading to increase in variance.