How to identify outliers in Machine Learning?

An outlier is a data point that is noticeably different from the rest. They represent errors in measurement, bad data collection, or simply show variables not considered when collecting the data.

Following are the Outliers Detection methods:

  • Sorting datasheet to find others: Sorting your datasheet is a simple but effective way to highlight unusual values. Simply sort your data sheet for each variable and then look for unusually high or low values.

  • Graphing Your Data: Boxplots, histograms, and scatterplots can highlight outliers. Boxplots display asterisks or other symbols on the graph to indicate explicitly when datasets contain outliers. These graphs use the interquartile method with fences to find outliers.

  • Using Z-scores: Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean.

  • Hypothesis Tests: You can use hypothesis tests to find outliers. Many outlier tests exist, but I’ll focus on one to illustrate how they work. In this post, I demonstrate Grubbs’ test, which tests the following hypotheses:

  1. Null: All values in the sample were drawn from a single population that follows the same normal distribution.
  2. Alternative: One value in the sample was not drawn from the same normally distributed population as the other values.