Violations of Regression Assumptions: Multicollinearity

An important assumption of multiple linear regression is that two or more regressors can't be perfectly linearly related. If that happens, it's just impossible to perform a regression.

Consider this simple dataset: | | X₁ | X₂ | Y | |---|---|---|---| | | 1 | 3 | 0.040 | | | 2 | 5 | 0.045 | | | 3 | 7 | 0.061 | | | 4 | 9 | 0.066 | | | 5 | 11 | 0.072 | Is $$ Y_i = b_0 + b_1X_{1i} + b_2X_{2,i} + \epsilon_i $$ something you think you can test?

No, you can't actually. There's a perfect linear relationship between _X_₁ and _X_₂.

Right.

It might not be obvious right away, but here $$\displaystyle X_2 = 2X_1 + 1$$. That perfect linear relationship will give you an error in a multiple regression, or else the software would just choose one of the variables to omit. Another way of showing this is to just calculate a correlation between _X_₁ and _X_₂. You'll find that it's +1.0. So multiple regression can't be done.

But suppose you look back at the data, and you find that some rounding was done. The _X_₂ values are a little different: | | X₁ | X₂ | Y | |---|---|---|---| | | 1 | 3.01 | 0.040 | | | 2 | 5.02 | 0.045 | | | 3 | 6.99 | 0.061 | | | 4 | 8.98 | 0.066 | | | 5 | 11.01 | 0.072 | Will a regression work now?

Here is the output: | $$ \, $$ | Coefficient | Standard Error | _t_-statistic | _p_-value | |---|---|---|---|---| | Intercept | 0.1834 | 0.0806 | 2.2768 | 0.1505 | | X₁ | 0.3079 | 0.1584 | 1.9428 | 0.1915 | | X₂ | -0.1500 | 0.0794 | -1.8892 | 0.1995 | | R² = 0.985 | | | | | Wow. Look at that R-squared value. That's amazing!

Right. The correlation between _X_₁ and _X_₂ is now 0.999988, so you'll get an output. But don't celebrate just yet.

Incorrect. There's no perfect correlation between _X_₁ and _X_₂ now. The correlation is now 0.999988, so you'll get an output. But don't celebrate just yet.

How would you characterize the ability of these two regressors to explain variation in the dependent variable?

Well, no. It's true that both regressors appear insignificant. But you can't get such a high R-squared without the regressors doing their job.

Actually, their "togetherness" is the problem here. Consider that the two regressors have a correlation that is almost perfect.

Exactly! The high R-squared means that they are definitely doing a good job. But the output doesn't give you any idea as to how each regressor is doing by itself. They are both insignificant right now, and that's the problem of __multicollinearity__, which is when there is almost a perfect linear connection between two or more independent variables. You get outputs like this, where the model seems to be doing well, but there's little significance in the regressors.

In fact, here is the output from two separate regressions, using each independent variable by itself. Using just _X_₁: | $$ \, $$ | Coefficient | Standard Error | _t_-statistic | _p_-value | |---|---|---|---|---| | Intercept | 0.0313 | 0.0034 | 9.0951 | 0.0028 | | X₁ | 0.0085 | 0.0010 | 8.1918 | 0.0038 | | R² = 0.957 | | | | | And just _X_₂: | $$ \, $$ | Coefficient | Standard Error | _t_-statistic | _p_-value | |---|---|---|---|---| | Intercept | 0.0270 | 0.0040 | 6.7579 | 0.0066 | | X₂ | 0.0043 | 0.0005 | 8.0381 | 0.0040 | | R² = 0.956 | | | | |

Given the very close connection between these two regressors, what do you expect would happen to the R-squared and the remaining coefficient _p_-value if two regressions were performed, one for each of the two independent variables by themselves?

Yes!

Not quite. Remember that there is a lot of significance to be found here, as long as it is properly uncovered.

No. Consider that, in this case, there isn't a lot of information being lost if a regressor is removed. The R-squared shouldn't drop too far.

The R-squared in each case still shows more than 95% of variation being explained. The _p_-value drops like a rock to a very low level in each case, and _X_₂ even switches signs, demonstrating the very unstable estimates found when multicollinearity is present. The significance was there, but it was hidden by the multicollinearity. In more complex and less obvious situations, you'll notice the same thing: good model significance, but suspiciously weak regressors.

To sum up: [[summary]]

No, that wouldn't happen. There isn't a lot of information being lost if a regressor is removed in this case. The R-squared shouldn't drop too far.

Sure

Not a chance

Yes

Not likely

Neither explain the variation

They explain it well, but only together

They each explain the variation, and to about the same degree

R-squared would drop _slightly_, and the _p_-value of the remaining regressor would be _far_ lower

R-squared would drop _slightly_, and the _p_-value of the remaining regressor would be _slightly_ lower

R-squared would drop _significantly_, and the _p_-value of the remaining regressor would be _far_ lower

R-squared would drop _significantly_, and the _p_-value of the remaining regressor would be _slightly_ lower

Continue

Violations of Regression Assumptions: Multicollinearity

The quickest way to get your CFA® charter