Monday, May 11, 2015

Centering and Standardizing: Don't Confuse Your Rows with Your Columns

R uses the generic scale( ) function to center and standardize variables in the columns of data matrices. The argument center=TRUE subtracts the column mean from each score in that column, and the argument scale=TRUE divides by the column standard deviation (TRUE are the defaults for both arguments). For instance, weight and height come in different units that can be compared more easily when transformed into standardized deviations. Since such a linear transformation does not alter the correlations among the variables, it is often recommended so that the relative effects of variables measured on different scales can be evaluated. However, this is not the case with the rows.

A concrete example will help. We ask a group of consumers to indicate the importance of several purchase criteria using a scale from 0=not at all important to 10=extremely important. We note that consumers tend to use only a restricted range of the scale with some rating all the items uniformly higher or lower. It is not uncommon to interpret this effect as a measurement bias, a preference to use different portions of the scale. Consequently, we decide to "correct for scale usage" by calculating deviation scores. First, we compute the mean score for each respondent across all the purchase criteria ratings, and then we subtract that mean from each rating in that row so that we have deviation scores. The mean of each consumer or row is now zero. Treating our transformed scores like any other data, we run a factor analysis using our "unbiased" deviation ratings.

Unfortunately, we are now measuring something different. After row-centering, individuals with high product involvement who place considerable importance on all the purchase criteria have the same rating profiles as those more casual users who seldom attend to any of the details. In addition, by forcing the mean for every consumer to equal zero, we have created a linear dependency among the p variables. That is, we started with p separate ratings that were free to vary and added the restriction that the p variables sum to zero. We lose one degree of freedom when we compute scores that are deviations about the mean (as we lose one df in the denominator for the standard deviation and divide by n-1 rather than n). The result is a singular correlation matrix that can no longer be inverted.

Seeing is believing

The most straightforward way to show the effects of row-centering is to generate some multivariate normal data without any correlation among the variables, calculate the deviation scores about each row mean, and examine any impact on the correlation matrix. I would suggest that you copy the following R code and replicate the analysis.

# p is the number of variables
# simulates 1000 rows with
# means=5 and std deviations=1.5
apply(x, 2, sd)
# calculate correlation matrix
# correlations after columns centered
# i.e., column means now 0's not 5's
x2<-scale(x, scale=FALSE)
apply(x2, 2, sd)
round(R2-R,8) # identical matrices
# checks correlation matrix singularity
# row center the ratings
x_rowcenter<-x-apply(x, 1, mean)
round(RC-R,8) # uniformly negative
# row-centered correlations singular
# orginal row means normally distributed
# row-centered row means = 0
# mean lower triangular correlation
mean(R[lower.tri(R, diag = FALSE)])
mean(RC[lower.tri(R, diag = FALSE)])
# average correlation = -1/(p-1)
# for independent variables
# where p = # columns

[Created by Pretty R at]

I have set the number of variables p to be 11, but you can change that to any number. The last line of the R code tells us that the average correlation in a set of p independent variables will equal -1/(p-1) after row-centering. The R code enables you to test that formula by manipulating p. In the end, you will discover that the impact of row-centering is greatest with the fewest number of uncorrelated variables. Of course, we do not anticipate independent measures so that it might be better to think in terms of underlying dimensions rather than number of columns (e.g., the number of principal components suggested by the scree test). If your 20 ratings tap only 3 underlying dimensions, then the p in our formula might be closer to 3 than 20.

At this point, you should be asking how we can run regression or factor analysis when the correlation matrix is singular. Well, sometimes we will get a warning, depending on the rounding error allowed by the package. Certainly, the solve function was up to the task. My experience has been that factor analysis packages tend to be more forgiving than multiple regression. Regardless, a set of variables that are constrained to sum to any constant cannot be treated as if the measures were free to vary (e.g. market share, forced choice, rankings, and compositional data).

R deals with such constrained data using the compositions package. The composition of an entity is represented by the percentages of the elements it contains. The racial or religious composition of a city can be the same for cities of very different sizes. However, it does not matter whether the sum is zero, one or a hundred. Forced choice, such as MaxDiff (Sawtooth's MaxDiff=100*p), and ranking tasks (sum of the ranks) are also forced to add to a constant.

I have attempted to describe some of the limitations associated with such approaches in an earlier warning about MaxDiff. Clearly, the data analysis becomes more complicated when we place a priori restrictions on the combined values of a set of variables, suggesting that we may want to be certain that our work requires us to force a choice or row center.

No comments:

Post a Comment