Unfortunately, we are now measuring something different. After row-centering, individuals with high product involvement who place considerable importance on all the purchase criteria have the same rating profiles as those more casual users who seldom attend to any of the details. In addition, by forcing the mean for every consumer to equal zero, we have created a linear dependency among the p variables. That is, we started with p separate ratings that were free to vary and added the restriction that the p variables sum to zero. We lose one degree of freedom when we compute scores that are deviations about the mean (as we lose one df in the denominator for the standard deviation and divide by n-1 rather than n). The result is a singular correlation matrix that can no longer be inverted.
Seeing is believing
The most straightforward way to show the effects of row-centering is to generate some multivariate normal data without any correlation among the variables, calculate the deviation scores about each row mean, and examine any impact on the correlation matrix. I would suggest that you copy the following R code and replicate the analysis.
library(MASS) # p is the number of variables p<-11 # simulates 1000 rows with # means=5 and std deviations=1.5 x<-mvrnorm(n=1000,rep(5,p),diag((1.5^2),p)) summary(x) apply(x, 2, sd) # calculate correlation matrix R<-cor(x) # correlations after columns centered # i.e., column means now 0's not 5's x2<-scale(x, scale=FALSE) summary(x2) apply(x2, 2, sd) R2<-cor(x2) round(R2-R,8) # identical matrices # checks correlation matrix singularity solve(R) # row center the ratings x_rowcenter<-x-apply(x, 1, mean) RC<-cor(x_rowcenter) round(RC-R,8) # uniformly negative # row-centered correlations singular solve(RC) # orginal row means normally distributed hist(round(apply(x,1,mean),5)) # row-centered row means = 0 table(round(apply(x_rowcenter,1,mean),5)) # mean lower triangular correlation mean(R[lower.tri(R, diag = FALSE)]) mean(RC[lower.tri(R, diag = FALSE)]) # average correlation = -1/(p-1) # for independent variables # where p = # columns -1/(p-1)
[Created by Pretty R at inside-R.org]
I have set the number of variables p to be 11, but you can change that to any number. The last line of the R code tells us that the average correlation in a set of p independent variables will equal -1/(p-1) after row-centering. The R code enables you to test that formula by manipulating p. In the end, you will discover that the impact of row-centering is greatest with the fewest number of uncorrelated variables. Of course, we do not anticipate independent measures so that it might be better to think in terms of underlying dimensions rather than number of columns (e.g., the number of principal components suggested by the scree test). If your 20 ratings tap only 3 underlying dimensions, then the p in our formula might be closer to 3 than 20.
At this point, you should be asking how we can run regression or factor analysis when the correlation matrix is singular. Well, sometimes we will get a warning, depending on the rounding error allowed by the package. Certainly, the solve function was up to the task. My experience has been that factor analysis packages tend to be more forgiving than multiple regression. Regardless, a set of variables that are constrained to sum to any constant cannot be treated as if the measures were free to vary (e.g. market share, forced choice, rankings, and compositional data).
R deals with such constrained data using the compositions package. The composition of an entity is represented by the percentages of the elements it contains. The racial or religious composition of a city can be the same for cities of very different sizes. However, it does not matter whether the sum is zero, one or a hundred. Forced choice, such as MaxDiff (Sawtooth's MaxDiff=100*p), and ranking tasks (sum of the ranks) are also forced to add to a constant.
I have attempted to describe some of the limitations associated with such approaches in an earlier warning about MaxDiff. Clearly, the data analysis becomes more complicated when we place a priori restrictions on the combined values of a set of variables, suggesting that we may want to be certain that our work requires us to force a choice or row center.
No comments:
Post a Comment