Outliers and k-means clustering

It’s a no surprise that K-means clustering involves “mean” and “distances” that makes it extremely sensitive to the outliers. One traditional approach is to first remove the outliers, by combination of domain and statistical knowledge. This however is very time consuming. Consider k-mediods instead of k-means in that case, if the number of outliers are not proportionately large. However, if you have over 20% that are outliers or bad data, consider cleaning the data thoroughly.

Because the mean, as a statistic, is generally sensitive to outliers.

The mean of 2,2,2,3,3,3,4,4,42,2,2,3,3,3,4,4,4 is 33

If we add a single 2323 to that, the mean becomes 55, which is larger than any of the other values.

Since in k-means, you’ll be taking the mean a lot, you wind up with a lot of outlier-sensitive calculations.

That’s why we have the k-medians algorithm. It just uses the median rather than the mean and is less sensitive to outliers.