Monday, July 6, 2015

Regression with Multicollinearity Yields Multiple Sets of Equally Good Coefficients

The multiple regression equation represents the linear combination of the predictors with the smallest mean-squared error. That linear combination is a factorization of the predictors with the factors equal to the regression weights. You may see the words "factorization" and "decomposition" interchanged, but do not be fooled. The QR decomposition or factorization is the default computational method for the linear model function lm() in R. We start our linear modeling by attempting to minimize least square error, and we find that a matrix computation accomplishes this task fast and accurately. Regression is not unique for matrix factorization is a computational approach that you will rediscover over and over again as you add R packages to your library (see Two Purposes for Matrix Factorization).

Now, multicollinearity may make more sense. Multicollinearity occurs when I have included in the regression equation several predictors that share common variation so that I can predict any one of those predictors from some linear combination of the other predictors (see tolerance in this link). In such a case, it no longer matters what weights I give individual predictors for I get approximately the same results regardless. That is, there are many predictor factorizations yielding approximately the same predictive accuracy. The simplest illustration is two highly correlated predictors for which we obtain equally good predictions using any one predictor alone or any weighted average of the two predictors together. "It don't make no nevermind" for the best solution with the least squares coefficients is not much better than the second best solution or possibly even the 100th best solution. Here, the "best" solution is defined only for this particular dataset before we ever begin to talk about cross-validation.

On the other hand, when all the predictors are mutually independent, we can speak unambiguously about the partitioning of R-squared. Each independent variable makes its unique contribution, and we can simply add their impacts for the total is truly the sum of the parts. This is the case with orthogonal experimental designs where one calculates the relative contribution of each factor, as one does in rating-based conjoint analysis where the effects are linear and additive. However, one needs to be careful when generalizing from rating-based to choice-based conjoint models. Choice is not a linear function of utility so that the impact on share from changing any predictor depends on the values of all the predictors, including the predictor being manipulated. Said differently, the slope of the logistic function is not constant but varies with the values of the predictors.

We will ignore nonlinearity in this post and concentrate on non-additivity. Our concern will be the ambiguity that enters when the predictors are correlated (see my earlier post on The Relative Importance of Predictors for a more complete presentation).

The effects of collinearity are obvious from the formula calculating R-squared from the cells of the correlation matrix between y and the separate x variables. With two predictors, as shown below by the subscripts 1 and 2, we see that R-squared is a complex interplay of the separate correlations of each predictor with y and the interrelationships among the predictors. Of course, everything simplifies when the predictors are independent with r(1,2)=0 and the numerator reducing to the sum of the squared correlations of each predictor with y divided by a denominator equal to one.

The formulas for the regression coefficients mirror the same "adjustment" process. If the correlation between the first predictor and y represents the total effect of the first variable on y, then the beta weight shows the direct effect of the first variable after removing the its indirect path through the second predictor. Again, when the predictors are uncorrelated, the beta weight equals the correlation with y.

We speak of this adjustment as controlling for the other variables in the regression equation. Since we have only two independent variables, we can talk of the effect of variable 1 on y controlling for variable 2. Such a practice seems to imply that the meaning of variable 1 has not been altered by the controlling for variable 2. We can be more specific by letting variable 1 be a person's weight, variable 2 be a person's height and the dependent variable be some measure of health. What is the contribution of weight to health controlling for height? Wait a second, weight controlling for height is not the same variable as weight. We have a term for that new variable; we call it obesity. Simply stated, the meaning of a term changes as move from the marginal (correlations) to conditional (partial correlations) representations.

None of this is an issue when our goal is solely prediction. Yet, the human desire to control and explain is great, and it is difficult to resist the temptation to jump from association to causal inference. The key is not to accept the data as given but to search for a representation that enables us to estimate additive effects. One alternative treats observed variables as the bases for latent variable regression in structural equation modeling. Another approach, nonnegative matrix factorization (NMF), yields a representation in terms of building blocks that can additively be combined to form relatively complex structures. The model does not need to be formulated as a matrix factorization problem in order for these computational procedures to yield solutions.


  1. Which "none" are you speaking of when you say 'none of this is an issue for prediction'? I would argue multi-colinearity remains an issue in prediction because it's presence forces the new data to have more of the same structure than if you drop the collinear variables. The risks in deploying a model with multi-colinearity are higher, so its better to remove the variables with low information gain.

    1. Any reader interested in the distinction between prediction and explanation should read the paper by Galit Shmueli. The article “To Explain or To Predict?” and an accompanying video can be found at her website,