Pages

Monday, April 8, 2013

Halo Effects vs. Intention-Laden Ratings: Separating Baby and Bathwater

Are halo effects real or illusory?  Much has been written arguing that rating scales contain extensive amounts of measurement bias.  Some tell us to avoid ratings altogether (What do customers really want?).  Others warn against the use of ratings scales without major adjustments (e.g., overcoming scale usage heterogeneity with the R package bayesm).  Obviously, by including the baby and bathwater idiom, I believe that there may be something "real" in those halo effects - real in the sense that a tendency to rate all the items higher or all the items lower may tells us much about one's intentions toward the entity being rated.

A Concept from Performance Appraisal

The concept of a halo effect flows from a dualism inherent in some theories of human perception.  There is a world independent of the observer, and there are reports about that world from human informants.  It is not an accident that the first use of the term "halo effect" came from human resources.  Personnel decisions are supposedly based on merit.  An employee's performance is satisfactory or it is not.  We do not collect supervisor ratings because we care about the feelings of the supervisor.  The supervisor is merely the measuring instrument, and in such cases, halo effects seem to be a form of measurement bias.  To be clear, halo effects are only bias when the informant is the measurement instrument and nothing more.  If we cared about the feelings of our observers, then removing the "halo effect" would constitute throwing the baby out with the bathwater.

Why Intention-Laden Ratings?

The title, Intention-Laden Ratings, comes from N.R. Hanson's work on theory-laden observation.  According to Hanson, when Galileo looked at the moon, he did not see discontinuities on the lunar surface.  He saw craters; he saw the quick and violent origins of these formations.  Similarly, the phrase "intention-laden ratings" is meant to suggest that ratings are not simply descriptive. Observations are not theory-free, and ratings are not intention-free. Perception serves action, and ratings reflect intentions to act. Failure to understand this process dooms us to self-defeating attempts to control for halo effects when what is called "halo" is the very construct that we are trying to measure.

We seem to forget that we are not reading off numbers from a ruler. People provide ratings, and people have intentions. I intend to continue using R for all my statistical analysis. I climbed up that steep learning curve, and I am not going back to SPSS or SAS. Still, I would not give top-box ratings to every query about R. R has its own unique pattern of strengths and weaknesses, its own "signature." There may be considerable consensus among R users due to common experiences learning and using R and because we all belong to an R community that shares and communicates a common view. Moreover, I would expect that those who have not made a commitment to R do not see a different relative pattern of pros and cons. They would, however, tend to give lower ratings across all the items.  What some might call "halo" in brand ratings can more accurately be described as intention-laden ratings, in particular, a commitment or intention to continued brand usage.  If it helps, think of intention-ladened as a form of cognitive dissonance.

R Enables Us to Provide an Example

An example might help explain my point.  Suppose that we collected five-point satisfaction ratings from 100 individuals who responded "yes" when asked if they use R for at least some portion of their statistical computing.  Let us say that we asked for nine ratings tapping the features that one might mention when recommending R or when suggesting some other statistical software.  Here is a summary of the nine ratings.

Number of Respondents Giving Each Score 1-5
var mean sd 1 2 3 4 5
1 4.59 0.89 2 2 9 9 78
2 4.46 1.08 4 4 9 8 75
3 4.57 0.84 1 2 11 11 75
4 3.79 1.30 6 12 25 11 46
5 3.84 1.24 7 4 31 14 44
6 3.64 1.35 12 5 28 17 38
7 3.52 1.42 10 19 18 15 38
8 3.60 1.37 12 10 19 24 35
9 3.53 1.32 11 9 28 20 32

This is what you generally find from users of a brand.  These users are satisfied with R, as you can see from the large number under the top-box column (labeled "5").  R seems to be doing particularly well on the first three rating items.  However, there seems to be some users who are less happy with the features tapped by the later ratings, especially the last three or four items.  One should note that the range of the means are somewhat limited (4.59-3.52=about 1 point on a five point scale).  I draw your attention to this because many do not realize that this one point difference in the means represents more than a doubling of the percentage in top-box ratings (78/38=more than 2).

I have not provided labels or names for the nine ratings.  Can you?  What are the attributes getting the highest ratings?  Is it cost, availability, or graphics?   It certainly is not any of the features mentioned by Patrick Burns in his R Inferno.  No one is suggesting that those incomprehensible error messages are a plus for R.  At a more general level, the lowest ratings might be ease of learning, documentation, or the absence of a GUI.  Of course, none of this is unique to R.  What are the pros and cons of using SPSS or SAS or any other statistical software?

Forget software, what does a SmartPhone do well and when does one experience problems?  Ask users why they use an iPhone, and ask nonusers why they do not use an iPhone.  Now ask both buyers and rejecters to rate all the attributes just mentioned.  The ratings from both groups will follow the same rank order; however, the buyers will give uniformly higher ratings across all the attributes.  Every product has its own unique pattern of good and bad points.  It is all on the web in product reviews and social media.

Now, where's the halo effect?  We will need to look at the interconnections among the ratings, for example, the correlation matrix.

var 1 2 3 4 5 6 7 8 9
1 1.00 0.36 0.41 0.54 0.21 0.28 0.27 0.37 0.33
2 0.36 1.00 0.36 0.31 0.24 0.45 0.36 0.30 0.35
3 0.41 0.36 1.00 0.27 0.20 0.31 0.20 0.26 0.32
4 0.54 0.31 0.27 1.00 0.44 0.33 0.29 0.49 0.53
5 0.21 0.24 0.20 0.44 1.00 0.27 0.26 0.30 0.45
6 0.28 0.45 0.31 0.33 0.27 1.00 0.47 0.49 0.54
7 0.27 0.36 0.20 0.29 0.26 0.47 1.00 0.38 0.51
8 0.37 0.30 0.26 0.49 0.30 0.49 0.38 1.00 0.41
9 0.33 0.35 0.32 0.53 0.45 0.54 0.51 0.41 1.00

What should you look for in this correlation matrix?  First, the correlations are all positive, therefore, we believe that there might be a strong first principal component.  However, the correlations are not uniform; they range from 0.54 to 0.20.  The correlations among the last four ratings seem somewhat higher than the others.  Still, we do not wish to forget that these are the ratings with some of the largest standard deviations.  When we examined the frequency distributions earlier, we saw the sizable top-box percentages for the first few ratings.  Such restriction of range can attenuate correlations.  That is, the correlations might have been more uniform had we not encountered such severe truncation at the higher levels for the first few ratings.

A factor analysis might help us uncover the structure underlying these correlations.  We start by looking at the first principal component.  It accounts for 43% of the total variation.  A strong first principal component reflects our finding that all the correlations were positive and many were sizable (between 0.20 and 0.54).  As the percentage of the total variation accounted for by the first principal component grows, we will become more convinced that the perceptions of R can be explained by a single evaluative judgment. 

Although I have used the term "factor analysis," the pattern matrix below with standardized loading came from a principal component analysis where we used a varimax rotation of the first three principal components.

PC1 PC3 PC2
1 0.04 0.39 0.75
2 0.57 0.00 0.52
3 0.20 0.04 0.78
4 0.12 0.78 0.38
5 0.19 0.76 -0.02
6 0.79 0.19 0.20
7 0.78 0.21 0.04
8 0.44 0.48 0.25
9 0.58 0.56 0.14
PC1 PC3 PC2
SS loadings 2.18 1.97 1.71
Proportion Var 0.24 0.22 0.19
Cumulative Var 0.24 0.46 0.65

The first three principal components account for 65% of the variation.  We seem to be able to separate the nine ratings into three overlapping groups:  the first three ratings (1-3), the next two ratings (4-5), and the last four ratings (6-10).  However, there are a number of ratings with loadings on two varimax-rotated principal components (e.g., variables 2, 8, and 9).  So, what do we have here?  We can argue for the existence of three separate and orthogonal components by discounting the dual loading of our three ratings as poorly written items.  For instance, we might claim that if the items had been written more clearly, they would have loaded on only one of the rotated principal components.  The factor structure is not definitive and requires a considerable amount of "explaining away" in order to maintain a three-factor interpretation.

So far, we have run the traditional analyses.  Let us finish with a bifactor rotation that seeks to fit a factor model with both a general halo effect and specific orthogonal factors.  You can think of it as removing the general factor and then running a second factor analysis on whatever correlation remains.  You can find the details in a previous post.  You should note in the figure below that all nine ratings have general factor loadings from g.  In addition, we have the contribution of specific orthogonal factors F1*, F2*, and F3*.  All nine ratings have some correlation due to a common general factor g, and some ratings have even higher correlations because of the specific factors.  We still do not know if that general factor represents something real and important (baby) or can be attributed to measurement bias (bathwater).  All we have is the correlation matrix, and the correlation matrix alone will not resolve such indeterminacy.



The "Truth" is Revealed

I have not mentioned that all the R code needed to replicate this analysis is listed at the end of this post.  This was temporarily "hidden" from you since the data are not real but simulated using a function from the r package ltm.  I used a graded response model to generate the nine ratings.  However, instead of describing how the data were generated, I will first fit a graded response model to the nine ratings.  Once you see the "results," it will be easier to understand how the data were created.

I introduced the graded response model in an earlier post.  It is relatively straightforward.  We start with an underlying dimension along which respondents can be arrayed.  This is an individual difference dimension, so you should be thinking about seriation, ordination, ranking, or just lining up single file.  For example, I could ask my 100 respondents to stand in a line from shortest to tallest or least smart to most smart.  Height is an individual difference dimension, and so is intelligence.  But height explains behavior, such as, Einstein's inability to reach a book on the upper shelf of his bookcase.  Saying that Einstein was not smart enough to solve a physics problem explains nothing.  [For a more complete treatment, see Borsboom The Theoretical Status of Latent Variables, 2003.]

Failure to understand this distinction will result in confusion.  Individual difference dimensions are constructed to differentiate among individuals who often are very different from each other at different locations along the continuum.  Consider the spelling bee.  The knowledge and the skill needed to spell easy words is very different from the knowledge and skill needed to spell difficult words.  Yet, we can array individuals along a dimension of spelling ability, understanding that very different cognitive processing is being used by different ability spellers at different points along the scale.  Spelling is just one example.  You can substitute your favor comparative dimension and ask if they same criteria are used to differentiate among individuals at the lower and upper ends of the measure (e.g., physical fitness, cooking skills, or programming ability).

Returning to R, individuals with more negative opinions of R have different knowledge and usage experiences than those having favorable impressions.  Thus, if I wanted to differentiate among all respondents regardless of how much they liked or disliked R, I would need to include both some very "easy" items (e.g., downloading and installing) and an array of more "difficult" ratings (e.g., debugging and interpreting error messages).  When I say easy, I mean easy for R to achieve high levels of satisfaction among users.  If a respondent does not believe that R is easy to download and install, they do not have a favorable impression of the software.

This is a latent trait or item response model.  Individuals can be arrayed along a single dimension, and ratings items can be placed along that same dimension.  Actually, it is not the item that is placed along the continuum, but the separate scale levels.  According to the model, an individual with a certain level of R favorableness reads an item and decides which score best reflects their assessment using the five-point scale.  The pro-R individual might look at the top-box and say "Yes, that describes me" or "No, a four is more accurate."  It is all easier to understand if we look at some coefficients.

Extrmt1 Extrmt2 Extrmt3 Extrmt4 Dscrmn
Item 1 -2.47 -2.14 -1.46 -1.03 2.32
Item 2 -2.24 -1.80 -1.24 -0.91 2.06
Item 3 -3.34 -2.68 -1.58 -1.02 1.73
Item 4 -2.02 -1.22 -0.32 0.06 1.96
Item 5 -2.51 -2.10 -0.43 0.18 1.27
Item 6 -1.60 -1.31 -0.20 0.41 1.85
Item 7 -2.00 -0.91 -0.16 0.44 1.48
Item 8 -1.62 -1.10 -0.37 0.50 1.79
Item 9 -1.57 -1.10 -0.07 0.61 2.11

All these coefficients (except Dscrmn=discrimination index, see previous post for details) can be interpreted as z-scores with mean=0 and standard deviation=1.  They are the cutoffs for assigning scores to the rating items.  If I had a very negative impression of R, say two standard deviations below the mean, what ratings would I give?  Item #1 requires a z-score below -2.47 before a "1" was assigned.  I would not give a "2" because that score would be assigned only if the z-score were between -2.47 and -2.14.  So, I would give a "3" to Item #1.  Item 2 gets a "2" and so on.  The last items are all given ones.

To test your understading, how would a person with a z-score of +1 rate the nine items?   It should be clear that such an individual would give all fives to all nine items for a total score of 45.  And that is what 10% of our respondents did.  Our nine items were not very "difficult" because even an average person with z=0 would have given three 5's and six 4's.  We seem to have low standards and some serious grade inflation.  It is easy for R to do well when these nine items are the criteria.  We need some more difficult items, meaning items that fewer respondents will give top-box ratings.  Like spelling bees, the difficult items that differentiate users at the upper end may look very different from those items that differentiate users in the middle of the scale.

Where Did the Factors Come From?

You might be wondering where the three factors came from if the data had been generated using a single dimension?  First, the factor structure is not the simple structure shown in textbook discussions of varimax rotations.  One must ignore the fact that one-third of the items load on multiple factors.  Second, although the match is far from perfect, there is a correspondence between the factors and differences in item difficulty.  Correlations are sensitive to skewness in the distributions of the variables so that one can mistakenly conclude that they have found factors when the differences in correlations are the result of differences in item means creating skewed distributions.

One last point needs to be made.  I wanted this data set to be realistic, so I did not assume that all customers were homogeneous.  Instead, it is more likely that the customer base for any product is composed of different customer types:  the hostages who are unhappy but cannot switch, the loyals who find special joy in the product, and the largest group of customers who are satisfied so that they are neither actively looking for alternatives nor are they brand advocates.

The rmvordlogis function that generated the 100 respondents takes cutpoints for the items in an argument called thetas and a set of latent trait scores for respondents in an argument called z.vals.  You can see in the code below that my z is a mixture of three different normal distributions in the underlying latent trait.  The first group, z1, with 65 of the 100 respondents represents the normally distributed satisfied customer with mean=0 and standard deviation=1.  Then I added 5 respondents in z2 who were hostages with mean=-2 and a smaller standard deviation of 0.5.  Finally, I included 30 loyal customers in z3 with mean=2 and SD=0.5.  This was done in order to produce a more realistic distribution with a positive skew on the latent trait.  Only when the switching costs are excessive do we find customer bases with sizable percentages of unhappy customers.

Saving the Baby

One needs to be careful when generalizing from performance appraisals that seek an unbiased observer to ratings of brand performance where the "bias" is the entity being measured.  Controlling for halo effects might make sense when we wish to exclude the observer's feelings.  Brand ratings are not one of those cases.    The general factor is not measurement bias (the halo effect) but brand equity, and that is what we wish to measure.  If we begin with a clear conception of the underlying construct, we will not make this mistake.

Measurement starts with a statistical model seeking to capture the data generation process.  As we have seen, three forces are at work generating brand ratings:  the product, the user, and the environment.  First, the product is real with affordances and constraints.  Every product has its own profile of strength and weaknesses that exist independent of user perception.  Second, customers may reach different overall product evaluations since each has his or her own needs and usage experiences.  However, such differences can be accommodated in the overall level of all the ratings and there is no need to change the relative rankings of the features.  The product remains more Feature A than Feature B with lower ratings on both features.  Third, all of this is shared through messaging by the brand, reviews in the press, and talk on social media.  The result is a common view of the strengths and weaknesses of each product, which I have called the "brand signature."  Everyone will not give the same ratings, but those ratings will follow the same pattern of higher and lower relative scores.

Finally, I generated a data set using the graded response model and showed how it produces results not unlike what is usually seen when we collect brand ratings.  It is the case that the product features or services that are easy for a brand to provide differ in kind or type from the product features or services that are difficult for a brand to provide.  For example, most would agree that R scores well on availability.  There are no payments that need to be authorized.  It is easy to download the base package and install updates or additional packages.  But the documentation is not quite at the level of an SPSS or a SAS.  Yet, documentation and availability are not two separate individual difference dimensions because availability will always be rated higher than documentation. Your score on the single latent trait and the relative positive of the items on the same scale is sufficient to predict all your ratings.  Documentation and availability denote different product aspects that function together as a single entity to differentiate satisfied from unsatisfied users.

Individual difference dimensions are heterogeneous because they attempt to differentiate at different ends of the scale. It is easy to misinterpret these mean level differences as if they were different factors or latent variables. We want to avoid making this mistake. The graded response model is the correct model specification for such individual difference dimensions.


 # Need two R packages
 library(psych)
 library(ltm)
  
 #use function from ltm to generate random data
 thetas<-NULL
 thetas[[1]]<-c(-4.0, -3.1, -2.1, -1.1, 1.2)
 thetas[[2]]<-c(-4.1, -3.2, -2.2, -1.2, 1.2)
 thetas[[3]]<-c(-3.9, -2.9, -1.9, -1.0, 1.2)
 thetas[[4]]<-c(-2.9, -1.9, 0.1, 1.1, 1.2)
 thetas[[5]]<-c(-2.8, -1.8, 0.2, 1.2, 1.2)
 thetas[[6]]<-c(-2.7, -1.7, 0.3, 1.3, 1.2)
 thetas[[7]]<-c(-2.1, -1.1, 0.1, 1.5, 1.2)
 thetas[[8]]<-c(-2.2, -1.2, 0.2, 1.6, 1.2)
 thetas[[9]]<-c(-2.3, -1.3, 0.3, 1.7, 1.2)
  
 #set seed for replication
 set.seed(4972)
  
 z1<-rnorm(65)
 z2<-rnorm(5, -2, 0.5)
 z3<-rnorm(30, 2, 0.5)
 z<-c(z1,z2,z3)
  
 ratings<-rmvordlogis(100, thetas, model="grm", IRT=FALSE, z.vals=z)
  
 #describe statistics and correlations
 describe(ratings)
 round(cor(ratings),2)
  
 #factor analyses
 bifactor<-omega(ratings, nfactor=3, plot=FALSE)
 bifactor
 omega.diagram(bifactor, main="")
 principal(ratings, nfactors=3)
  
 #graded response model
 descript(ratings)
 model<-grm(ratings)
 model  

No comments:

Post a Comment