The Population Variance is the Sum of squared differences between data and the population mean, divided by N, which is the population size.
However, we cannot measure each and every value in a population. Hence, we ‘estimate’ the Population statistics. The Estimated Population Variance is the Sum of squared differences of data and the sample mean, divided by N-1, where N is the sample size.
But why N-1 and not simply N?
The reason is that dividing by N underestimates the population variance. Consider a case where you know the population variance. So if you do multiple experiments by taking samples and calculating the variances, the value of the sample variance will ALWAYS be less than the population variance.
In other words, average of Sum of squared differences between data and sample mean will ALWAYS be less than average of Sum of squared differences between data and population mean. And how is that?
It turns out that the value of mean for which the variance will be minimum is, in fact, the Sample Mean! Therefore, no matter what, the variance calculated using the sample mean will always be less and underestimate the actual population variance.
Thus, to reduce this effect, we replace N by N-1.
#statistics #datascience