What is collinearity and what to do with it? How to remove multicollinearity?
The best way to identify the multicollinearity is to calculate the Variance Inflation Factor (VIF) corresponding to every independent Variable in the Dataset.
VIF tells us about how well an independent variable is predictable using the other independent variables. Let’s understand this with the help of an example.
Consider that we have 9 independent variables as shown. To calculate the VIF of variable V1, we isolate the variable V1 and consider as the target variable and all the other variables will be treated as the predictor variables.
We use all the other predictor variables and train a regression model and find out the corresponding R2 value.
Using this R2 value, we compute the VIF value gives as the image below.
Looking at the formulation we can clearly see that as the R2 value increases, the VIF value also increases. A higher R2 value signifies that:
“the target independent variable is very well explained by the other independent variables”
Now what should be the VIF threshold value to decide whether the variable should be removed or not?
It is always desirable to have VIF value as small as possible, but it can lead to many significant independent variables to be removed from the dataset. Therefore a VIF = 5 is often taken as a threshold. Which means that any independent variable greater than 5 will have to be removed. Although the ideal threshold value depends upon the problem at hand.