Outliers can be thought of as rare instances with extreme values, away from the majority data points. These might creep in due to measurement errors or data corruption, or they might indeed be actually true.
Outliers are usually removed so that they don’t hamper the learning of the model. However, outliers might contain the most crucial information in your data and removing them blindly could be a very bad step. Outliers should be treated depending on the domain and the dataset.
Considering Gaussian data, one of the ways to check for outliers is how many standard deviations away the data point is. Three standard deviations from the mean is a common cut-off in for identifying outliers. But this cut-off again depends on the domain.
Another way to treat outliers in non-Gaussian data is using Interquartile Range (IQR), where IQR is the difference between the 75th and the 25th percentiles of the data. Data points that are >1.5 IQRs away from the 25th and the 75th percentile are usually considered as outliers.
Alternatively, the model can be made more robust to outliers such as using Median Squared Error as loss function instead of Mean SE. Moreover, algorithms like decision trees are more robust to outliers.
#datascience #machinelearning