Is it possible to capture the correlation between continuous and categorical variable?
There are three big-picture methods to understand if a continuous and categorical are significantly correlated — point biserial correlation, logistic regression, and Kruskal Wallis H Test.
Point biserial Correlation
The point biserial correlation coefficient is a special case of Pearson’s correlation coefficient. I am not going to go in the mathematical details of how it is calculated, but you can read more about it here. I will highlight three important points to keep in mind though:
- Similar to the Pearson coefficient, the point biserial correlation can range from -1 to +1.
- The point biserial calculation assumes that the continuous variable is normally distributed and homoscedastic.
- If the dichotomous variable is artificially binarized, i.e. there is likely continuous data underlying it, biserial correlation is a more apt measurement of similarity. There is a simple formula to calculate the biserial correlation from point biserial correlation, but nonetheless this is an important point to keep in mind.
The idea behind using logistic regression to understand correlation between variables is actually quite straightforward and follows as such: If there is a relationship between the categorical and continuous variable, we should be able to construct an accurate predictor of the categorical variable from the continuous variable. If the resulting classifier has a high degree of fit, is accurate, sensitive, and specific we can conclude the two variables share a relationship and are indeed correlated.
There are a number of positive things about this approach. Logistic regression does not make many of the key assumptions of linear regression and other models that are based on least squares algorithms — particularly regarding linearity, normality, homoscedasticity, and measurement level. However, I should note here logistic regression does assume that there is a linear relationship between the predictors and the logit of the outcome variable and often times not only is this assumption invalid, it is also not straightforward to verify. I would advise the user to keep this in mind before using logistic regression. On the positive side, since we only have one feature for prediction, there is no problem of multicollinearity similar to other applications of logistic regression.
Kruskal-Wallis H Test (Or parametric forms such as t-test or ANOVA)
— Estimate variance explained in continuous variable using the discrete variable
The final family of methods to estimate association between a continuous and discrete variable rely on estimating the variance of the continuous variable, which can be explained through the categorical variable. There are many ways to do this. A simple approach could be to group the continuous variable using the categorical variable, measure the variance in each group and comparing it to the overall variance of the continuous variable. If the variance after grouping falls down significantly, it means that the categorical variable can explain most of the variance of the continuous variable and so the two variables likely have a strong association. If the variables have no correlation, then the variance in the groups is expected to be similar to the original variance.
Another approach which is more statistically robust and supported by a lot of theoretical work is using one-way ANOVA or its non-parametric forms such as the Kruskal-Wallis H test. A significant Kruskal–Wallis test indicates that at least one sample stochastically dominates another sample. The test does not identify where this stochastic dominance occurs or for how many pairs of groups stochastic dominance obtains. For analyzing the specific sample pairs for stochastic dominance in post hoc testing, Dunn’s test, pairwise Mann-Whitney tests without Bonferroni correction, or the more powerful but less well-known Conover–Iman test are appropriate or t-tests when you use an ANOVA…might be worth calling that out. Since it is a non-parametric method, the Kruskal–Wallis test does not assume a normal distribution of the residuals, unlike the analogous one-way analysis of variance. I should point out that though ANOVA or Kruskal-Wallis test can tell us about statistical significance between two variables, it is not exactly clear how these tests would be converted into an effect size or a number which describes the strength of association.