Assumptions Underlying Multiple Linear Regression

The aim of multiple linear regression is a lot like that of single-variable linear regression: Find the best slope estimates you can. Most assumptions are the same, too. If you're trying linear regression, you will obviously need to have a true linear relationship between the dependent variable and all of the regressors used (Assumption #1: Linearity). The error term is still expected to be centered around zero, to be normally distributed (Assumption #2: Normality), to have the same variance everywhere (Assumption #3: Homoskedasticity), and to be uncorrelated across observations (Assumption #4: Independence of Errors).

Now, there is an extra assumption about the regressors to consider, since there are more than one (Assumption #5: No multicollinearity). To illustrate, go back to algebra class for a moment. If you were given a line of $$\displaystyle y=2x+1$$, perhaps you could graph it. Then with a second line, you could find the solution of where they intersect. Maybe. But what if $$\displaystyle 2y=4x+2$$ were the second line given to you? You might notice that every term in the first line doubled to make this second line equation. Could you find a solution, or a single point of intersection, of these two lines?

Exactly! Good answer.

No, you couldn't find a solution here. No one could.

While it's true that there are two equations, there aren't two _different_ equations. One is a multiple of the other. So it can be removed without losing any information.

In multiple linear regression, the regression simply cannot run if this linear relation exists. Some software programs will just give you an error message, and others will solve the problem for you. What do you think is a simple solution?

Well, no. That would be like removing both lines. You'd have nothing left and would lose information.

Yes! Just like when you have two lines offering the same information, you just remove one. Again, many software packages do this automatically, since output can't be created when this problem exists. After all, if an independent variable is some perfect combination of other variables, then it isn't really independent at all.

No, that wouldn't work. This would be a gross oversimplification that would provide a meaningless output.

Once you have a model that is consistent with all necessary assumptions, you have estimated regressors that you can use to make predictions. Of course, you have to remember that each estimated parameter is just an estimate, and there's no simple way of making a prediction interval with multiple regression, but at least you can produce a point estimate. For example, suppose your model of next-period stock index returns was $$\displaystyle \hat{Y} = 0.025 + 0.03X_1 - 0.18X_2 + 1.58X_3$$, and you observe the following values for your independent variables: | $$ \, $$ | X₁ | X₂ | X₃ | |---|---|---|---| | | 0.50 | 0.62 | 0.09 | What is your best estimate of the next period's stock index return?

Not quite. It's possible that all terms were added rather than the _X_₂ term subtracted.

That's right! Just plug everything in. $$\displaystyle \hat{Y} = 0.025 + 0.03(0.50) - 0.18(0.62) + 1.58(0.09) = 0.0706 = 7.06 \% $$

That's not it. It's possible that the intercept term was ignored.

Now suppose that this estimate is fairly accurate. It may be that all of those _X_ values are pretty close to where they normally are, and where they were when the parameters were being estimated. But what if there is some shock? _X_₁ is almost always between 0.35 and 0.65, as it was for the observations used in estimating the model. But suppose next week it suddenly spikes to 1.1. Why might the estimate of next-period stock index returns be weaker?

No, that's not necessarily a problem. Some returns are high, and if a good model can predict that, then everything is OK.

Not exactly. The other regressors aren't required to move with the same magnitude, or even at all. That's not necessary for estimation here.

Right! This value is completely outside of what the model was given, and so it's not reasonable to assume that the model will be able to use it well in predicting returns based on what it was given. Maybe yes, maybe no. But values that are outside the range used for estimation really need to be taken with caution.

In summary: [[summary]]

No, since there is no new information in the second line

Absolutely, since there are two equations and two unknowns

Remove both linearly related regressors

Remove one of the linearly related regressors

Assume the same coefficient for all regressors

4.56%

7.06%

29.38%

The estimated return will be much higher than average

The other regressors may not be high enough to be comparable

The model was never given an observation like this to use in estimation

Continue

Assumptions Underlying Multiple Linear Regression

The quickest way to get your CFA® charter