Should outliers be removed before correlation analysis?

The outliers will have an impact on the correlation coefficient. For e.g. let’s say there’s ‘patient temperature’ and '‘critical illness score’ which means how critical the patient is. Now consider the following values of x and y:
x : [96, 98, 99, 101, 110, 58, 65]
y : [0.1, 0.05, 0.3, 0.9, 0.4, 0.8, 0.95]

Clearly, when the temperature is 101 F, the patient is likely to be more critical. However, readings 110, 58 and 65 are ‘bad measured data’ where probably the temperature was not measured well. However, if all the data points are considered, then clearly temperature will not be correlated to the criticality of the patients.

Particularly when outliers are very far off from the magnitude of the good data, then the impact of outliers is huge.

So yes, consider removing outliers even before correlation analysis.

A basic assumption in correlation is that there are no many outliers. In my opinion you can follow two routes: a) Detect outliers using “casewise diagnostics” and delete them (but you have to justified this), or b) Use a non-parametric alternative (Spearman’s rank-order correlation or Kendall’s Tau Correlation). Both of them are less sensitive to outliers.

I personally do not see a big problem to remove outliers from data set. For example in biology usually there are some outliers. This is a reason (among others) why we use replicates. What is important is to remove outliers correctly. Dixon’s test could be use for this if Your data are normal distributed. There is also Grubb’s test, but it detects only one outlier in the data set per 1 applying. It means that it is risky to use it for large data sets.

1 Like

Testing Sk Dubey