Data pre-processing techniques
- Winsorize (cap at threshold).
- Transform to reduce skew (using Box-Cox or similar).
Remove outliers if you’re certain they are anomalies or measurement errors.
One of the most important steps in data pre-processing is outlier detection and treatment. Machine learning algorithms are very sensitive to the range and distribution of data points. Data outliers can deceive the training process resulting in longer training times and less accurate models.
- Univariate method: This method looks for data points with extreme values on one variable.
- Multivariate method: Here we look for unusual combinations on all the variables.
- Minkowski error: This method reduces the contribution of potential outliers in the training process.