How to get the right value of K in K-Means?

The choice of k is always critical, in fact we have to pay attention at two different scenarios:

  • small k : since you are very close to the observation, you will have a low bias, but you may be affected by a strong variance due to the presence of some outliers
  • large k : if you enlarge your neighborhood you will be more robust to the outliers (low variance) but you will end up with an higher bias because you will probably consider points that are not so close.

The [bias-variance tradeoff is always behind the corner!

Ok great, but how can we select it? Unfortunately there isn’t an absolute answer but we have to try different values.

A common practice is to plot the reference error metric for different values of k. By doing so, you can see with your own eyes how your model behaves and what can be a good trade off.