An important assumption of multiple linear regression is that two or more regressors can't be perfectly linearly related. If that happens, it's just impossible to perform a regression.
Consider this simple dataset:
| | X1 | X2 | Y |
|---|---|---|---|
| | 1 | 3 | 0.040 |
| | 2 | 5 | 0.045 |
| | 3 | 7 | 0.061 |
| | 4 | 9 | 0.066 |
| | 5 | 11 | 0.072 |
Is $$ Y_i = b_0 + b_1X_{1i} + b_2X_{2,i} + \epsilon_i $$ something you think you can test?
No, you can't actually. There's a perfect linear relationship between _X_1 and _X_2.
Right.
It might not be obvious right away, but here
$$\displaystyle X_2 = 2X_1 + 1$$.
That perfect linear relationship will give you an error in a multiple regression, or else the software would just choose one of the variables to omit.
Another way of showing this is to just calculate a correlation between _X_1 and _X_2. You'll find that it's +1.0. So multiple regression can't be done.
But suppose you look back at the data, and you find that some rounding was done. The _X_2 values are a little different:
| | X1 | X2 | Y |
|---|---|---|---|
| | 1 | 3.01 | 0.040 |
| | 2 | 5.02 | 0.045 |
| | 3 | 6.99 | 0.061 |
| | 4 | 8.98 | 0.066 |
| | 5 | 11.01 | 0.072 |
Will a regression work now?
Here is the output:
| $$ \, $$ | Coefficient | Standard Error | _t_-statistic | _p_-value |
|---|---|---|---|---|
| Intercept | 0.1834 | 0.0806 | 2.2768 | 0.1505 |
| X1 | 0.3079 | 0.1584 | 1.9428 | 0.1915 |
| X2 | -0.1500 | 0.0794 | -1.8892 | 0.1995 |
| R2 = 0.985 | | | | |
Wow. Look at that R-squared value. That's amazing!
Right.
The correlation between _X_1 and _X_2 is now 0.999988, so you'll get an output. But don't celebrate just yet.
Incorrect.
There's no perfect correlation between _X_1 and _X_2 now. The correlation is now 0.999988, so you'll get an output. But don't celebrate just yet.
How would you characterize the ability of these two regressors to explain variation in the dependent variable?
Well, no. It's true that both regressors appear insignificant. But you can't get such a high R-squared without the regressors doing their job.
Actually, their "togetherness" is the problem here. Consider that the two regressors have a correlation that is almost perfect.
Exactly!
The high R-squared means that they are definitely doing a good job. But the output doesn't give you any idea as to how each regressor is doing by itself. They are both insignificant right now, and that's the problem of __multicollinearity__, which is when there is almost a perfect linear connection between two or more independent variables. You get outputs like this, where the model seems to be doing well, but there's little significance in the regressors.
In fact, here is the output from two separate regressions, using each independent variable by itself. Using just _X_1:
| $$ \, $$ | Coefficient | Standard Error | _t_-statistic | _p_-value |
|---|---|---|---|---|
| Intercept | 0.0313 | 0.0034 | 9.0951 | 0.0028 |
| X1 | 0.0085 | 0.0010 | 8.1918 | 0.0038 |
| R2 = 0.957 | | | | |
And just _X_2:
| $$ \, $$ | Coefficient | Standard Error | _t_-statistic | _p_-value |
|---|---|---|---|---|
| Intercept | 0.0270 | 0.0040 | 6.7579 | 0.0066 |
| X2 | 0.0043 | 0.0005 | 8.0381 | 0.0040 |
| R2 = 0.956 | | | | |
Given the very close connection between these two regressors, what do you expect would happen to the R-squared and the remaining coefficient _p_-value if two regressions were performed, one for each of the two independent variables by themselves?
Yes!
Not quite. Remember that there is a lot of significance to be found here, as long as it is properly uncovered.
No. Consider that, in this case, there isn't a lot of information being lost if a regressor is removed. The R-squared shouldn't drop too far.
The R-squared in each case still shows more than 95% of variation being explained. The _p_-value drops like a rock to a very low level in each case, and _X_2 even switches signs, demonstrating the very unstable estimates found when multicollinearity is present.
The significance was there, but it was hidden by the multicollinearity. In more complex and less obvious situations, you'll notice the same thing: good model significance, but suspiciously weak regressors.
To sum up:
[[summary]]
No, that wouldn't happen. There isn't a lot of information being lost if a regressor is removed in this case. The R-squared shouldn't drop too far.
Sure
Not a chance
Yes
Not likely
Neither explain the variation
They explain it well, but only together
They each explain the variation, and to about the same degree
R-squared would drop _slightly_, and the _p_-value of the remaining regressor would be _far_ lower
R-squared would drop _slightly_, and the _p_-value of the remaining regressor would be _slightly_ lower
R-squared would drop _significantly_, and the _p_-value of the remaining regressor would be _far_ lower
R-squared would drop _significantly_, and the _p_-value of the remaining regressor would be _slightly_ lower
Continue
Continue
Continue
Continue
Continue
Continue