What is Multicollinearity? How to find Multicollinearity?

Multicollinearity: This phenomenon exists when the independent variables are found to be moderately or highly correlated. In a model with correlated variables, it becomes a tough task to figure out the true relationship of a predictors with response variable. In other words, it becomes difficult to find out which variable is actually contributing to predict the response variable.

Another point, with presence of correlated predictors, the standard errors tend to increase. And, with large standard errors, the confidence interval becomes wider leading to less precise estimates of slope parameters.

Also, when predictors are correlated, the estimated regression coefficient of a correlated variable depends on which other predictors are available in the model. If this happens, you’ll end up with an incorrect conclusion that a variable strongly / weakly affects target variable. Since, even if you drop one correlated variable from the model, its estimated regression coefficients would change. That’s not good!

How to check: You can use scatter plot to visualize correlation effect among variables. Also, you can also use VIF factor. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Above all, a correlation table should also solve the purpose.

multicolinearity is when more than two variables have enough pairwise correlations that one (or more) can be dropped out with little or no effect on the explanatory power of the remaining ones. This can happen when no two of them have a very high correlation, but it has the same effect as when there are just two variables which are highly correlated and so using both gives little benefit over using one.

This is a significant problem in multiple regression when doing a “stepwise regression” ANOVA analysis as the variables entered later will account for less and less of the total Sum of Squares of the data. So the order of the variables/steps makes a major difference in interpretation of the effect of each of the variables.

One way of avoiding this problem is to run multple analyses with each variable being the first variable in the ANOVA.

Another way is to do a Principle Component Analysis (PCA) to get fewer variables which are completely uncorrelated. (And so the order in which they are entered has no effect.) The downside of this (IMHO very good) approach is that it may be hard to interpret the new variables.