Outlier Treatment
- Mean/Median or random Imputation
- Trimming
- Top, Bottom and Zero Coding
- Discretization
Mean/Median or random Imputation
If we have reasons to believe that outliers are due to mechanical error or problems during measurement. That means, the outliers are in nature similar to missing data, then any method used for missing data imputation can we used to replace outliers. The number are outliers are small (otherwise they won’t be called outliers) and it’s reasonable to use mean/median/random imputation to replace them.
Trimming:
In this method, we discard the outliers completely. That is, eliminate the data points that are considered as outliers. In situations where you won’t be removing a large number of values from the dataset, trimming is a good and fast approach.
Top / bottom / zero Coding:
Top Coding means capping the maximum of the distribution at an arbitrary set value. A top coded variable is one for which data points above an upper bound are censored. By implementing top coding, the outlier is capped at a certain maximum value and looks like many other observations.
Bottom coding is analogous but on the left side of the distribution. That is, all values below a certain threshold, are capped to that threshold. If the threshold is zero, then it is known as zero-coding . For example, for variables like “age” or “earnings”, it is not possible to have negative values. Thus it’s reasonable to cap the lowest value to zero.
Discretization:
Discretization is the process of transforming continuous variables into discrete variables by creating a set of contiguous intervals that spans the range of the variable’s values. Thus, these outlier observations no longer differ from the rest of the values at the tails of the distribution, as they are now all together in the same interval/bucket.
There are several approaches to transform continuous variables into discrete ones. This process is also known as binning , with each bin being each interval.