Thursday, April 25, 2013

A Call for Context-Aware Measurement

Context awareness seems to be everywhere, and everyone seems to be saying that context matters.  Gartner foresees "a game-changing opportunity" in what it calls context-aware computing.  The title of their report states it best, "Context Shapes Demand at Moments of Truth."  Their reasoning is straightforward.  They assume that what you want (your preferences) are dependent on the consumption context.  Once they know your context (e.g., from your mobile devices or on-site surveillance), they can feed you information, advertising, promotions and anything else that would help sell their products or services.  This is another sign of the growing acceptance that context plays a determinant role in consumer decision making.

If who, what, where, when and why are so important, how do we explain context-free measurement in marketing research?  For example, "What Do Customers Really Want?" is a Harvard Business Review article outlining "best practices" for the measurement of preference.  According to these Bain and Company researchers, rating scales are "blunt instruments" unable to distinguish between "nice to have" and "gotta have."  Thus, instead of rating scales, customers are asked to make a series of trade-off among different sets of four restaurant attributes, such as:
  • Food served hot and on time,
  • New specials weekly,
  • Healthful menu options, and
  • Portions are just right.
The authors came to the conclusion that the most important attribute for their restaurant client was food served hot and on time.  Obviously, no one was ordering a cool crisp salad for lunch or dinner.  In fact, the meal occasion was never mentioned.  Shouldn't it make a difference whether the meal is breakfast, lunch, dinner, or a late night snack?  Does it matter if the meal is with family, friends, or business clients?  What if the meal was a date, a birthday, or just because you have nothing to eat in the house or don't feel like cooking?  How can our measurements be context aware when our questions contain no contextual information?

Our response scale is no better.  There is no contextual information in an importance rating.  Nor is there any contextual information in the MaxDiff or best-worst scaling that was advocated in the above article.  Let us start by adding context to the response scale.  What would be the impact of serving "portions that were just right"?  In order for a respondent to be able to answer this question, we need to add a context and make the attribute more specific so that the respondent has something concrete to consider.  For instance, what customer behaviors do we seek to change by offering small portion sizes at cheaper prices?  Is it something that customers would not care about, would be nice to have, or would it break the tie between two otherwise equal restaurants and get you to select one restaurant over another?  This is an example of context-aware measurement.  We have a response scale with several marketplace alternatives that encourage our respondents to think about their likely response within a realistic purchase context.

It's Simpler If You Make It More Complex

Before I move on to the measurement model and its implementation in R, I recommend that you visit the Hartman Group's website called the World of Occasions.  Here you will find restaurants as one of many possible eating occasions.  Each occasion is defined by who, when, where, what and why.  You should note what might seem to be a paradoxical effect.  The Hartman Group appears to have made things more complex with their multifaceted world of occasions.  However, once an occasion has been identified, the consumer uses the context to help them reach their purchase decisions.  This is a form of situated cognition.  Consumers simplify their decision tasks by using the occasion to structure the problem space.  Instead of needing to reassess preference anew in each situation, preference is tied to the occasion so that identifying the context activates the preference structure and simplifies the choice process.  In fact, once the context is seen as a particular type of occasion, it evokes its own action possibilities, which is a property of all perceived affordances.

The Rating Scale Isn't Broken - You just don't know how to use it

We are ready for the last component, the context-aware measurement model, which we will estimate using the R statistical programming language.
Context-free measurement provides few specifics and asks the respondent to fill in the blanks.  The respondent is free to use any criteria they wish to decide which feature is the most or least important.  Similarly, although the endpoints may be marked with words such as "not at all" or "extremely," importance itself is left undefined when a rating scale is used.   As a result, respondents will give us responses when we ask them to select the best and the worst among alternatives with no clear external referents.  What are "healthful menu options" or "new specials weekly"?  My favorite is "portions are just right."  Who doesn't want portions to be just right?  It is not uncommon for respondents to report that attributes such as color or price are unimportant until they are shown specific colors ("I will not buy it in that color") or actual price points ("I'm not paying that much").  Without a realistic context, the entire data collection process becomes an exercise in semantic reasoning yielding little or no actionable findings.

Context-aware measurement resolves the ambiguity by using behavioral anchors.  But we lose the "flexibility" of context-free measurement.  Context is added by using feature descriptions with high imagery and specificity.  As we saw earlier, "portions are just right" leaves too much to the imagination.  We need to get much more specific in order for respondents to tell us how they are likely to behave in a realistic context, which means we need a lot more items and some way to manage respondent burden.

We are using the respondent as the measuring instrument: input = product or service features and output = behavioral response.  Each contextualized feature or service is evaluated by the respondent and assigned a score using the behavioral anchors.  We believe that the "pull" or "attractiveness" of the features and services can be measured along a continuum with thresholds separating the different scale values.  For example, a respondent is shown Attribute A and imagines their likely response using their thresholds:  Do I care about Attribute A?  Would I pick one over the other because of Attribute A?  Would I pay more to get Attribute A?  All this is shown in the figure below with the change in color indicating the threshold that needs to be exceeded before the next response category is picked.

The theoretical foundation for this measurement procedure is the belief that it reflects the simplification processes that consumer actually use to make purchases.  That is, consumers compare very different features and services by placing them all on the same preference scale.  In order to decide that "you want a fine dining experience, but it costs too much," you must be able to place your desire for a fine dining experience on the same scale as the cost of the meal.  It is essential that we provide a marketplace justification for our measurement technique.  Otherwise, we find ourselves playing games with attribute comparisons that have little or no real world implications.

However, we still have a self-report.  The customer is not actually experiencing the contextualized feature or service.  As Gilbert and Wilson note, "Just as retrospection refers to our ability to re-experience the past, prospection refers to our ability to 'pre-experience' the future by simulating it in our minds."  Thus, respondents read "seldom more than one or two cars ahead of you in the drive-thru" and note their reactions upon imagining themselves in that situation.  Then, they translate that reaction into one of the behavioral categories.  Gilbert and Wilson list all the errors associated with this form of mental simulation.  "The devil is in the details" is one way to summarize their more comprehensive discussion.  For instance, it makes a difference if the two cars ahead of you are filled with people who seem to be having some difficulty deciding what they want to eat.  But it is unlikely that this is what you thought about when you first read the description.

Two Forces at Work Creating a Single Evaluative Dimension

Now, we are ready to search for underlying structure.  Our goal is learning what customers really want, and we already know a great deal from observing successful offerings in the marketplace.  Products and services evolve over time to satisfy customer demands.  Customers learn what they want as they consume what is available.  This is a simultaneous evolution as brands improve their offerings to better satisfy their customers' demands that are changing at the same time in response to product improvements.  When an equilibrium is reached, we have a relatively stable configuration  taking the form of an underlying continuum of product and services ranging from the basics that most want to the high-end that fewer and fewer customers seek.  This is a shared understanding between providers and consumers.  It is communicated by brand messaging and repeated in the press and through word of mouth.

As consumers become familiar with the product categories (e.g., fast food, casual dining, fine restaurants, buffets, pubs, cafes, diners, drive-ins and dives), they learn about this product and service continuum and determine where they fall along this dimension.  That is, consumers learn what is available and decide how much they want given what is available.  I have used the phrase "decide how much they want" because providers have deliberately configured their product and service offerings so that there is always a continuum from basic to premium (good, better, best).  A basic product with only those features and services defining the product category is offered at the lowest price to appeal to the low-end user.  Additional features are included at some cost in a cumulative fashion to attract the more demanding users.  Finally, you will find the premium offering targeting the highest end of the market.

If we are careful when we specify the restaurant type and the eating occasion, we can anticipate uncovering a single continuum along which all the contextualized features and services can be arrayed.  Obviously, consumers are looking for different benefits when they go to an all-you-can-eat buffet rather than an upscale restaurant.  However, is restaurant type alone sufficient to yield a single dimension?  For instance, does everyone who decides to go to an all-you-can-eat buffet looking for the about the same features and services?  Can I array all the contextualized features and services along a single dimension in terms of their pull or attraction?  The answer is no if customers are looking for different products and services when they visit the buffet for lunch rather than dinner?  The answer is also no when older patrons are seeking different benefits (the rationale for early bird specials).

Our first test might be to assess the fit of the graded response model.  The graded response model (grm) is one of many models from item response theory covered by the R package ltm.  It is used with ordinal scales such as the behavioral anchors tapping pull or attraction.  The assumption is that respondents possess different levels of the underlying trait being measured by the items.  In this case the items are descriptions of contextualized products and services.  Thus, respondents differ in how much they want or demand with some seeking only basic products and services and others desiring much more than the basics.  The score that an individual assigns to a specific item depends on the relationship between the individual's demand level and the item's location on the same scale.  Each item will have its own set of thresholds.  Basic products and services that everyone expects to be delivered will receive higher ratings from everyone but only the highest ratings from the more demanding.  On the other hand, those products and services that only the most demanding expect or want will be rated higher only by individuals demanding more.

The R package lordif (logistic ordinal differential item functioning) provides the test and identifies individual items that function differently for different groups of respondents.  We are not testing whether different groups are more or less demanding.  Mean level differences are not a problem.  Differential item functioning occurs when the item parameters have different values for different groups after controlling for latent demand.  For example, if the lunch crowd was just less demanding, we would not have differential item functioning since all the ratings would be lower.  It is only when lunch and dinner customers want different things that we are forced to separate the two groups.  Differential item functioning will check if time of day (lunch, early bird, dinner) or other observed variables matter.  We stop once we find a single dimension with ordinal responses explained by the graded response model.

Exploiting the Single Dimension from Market Structure and Consumer Simplification Strategies

It is not difficult to imagine having to ask many contextualized product and service items.  As an example, if we think about convenience of a fast food restaurant, it is easy to generate a long list of questions that make fast food quick and easy: location relative to home or work, ease of entry/exit, drive-thru, parking, wait to order, easy of ordering, wait for food preparation, payment alternatives and time, difficulty eating there or in your car, and many more at increasing levels of specificity.  Although this level of detail is needed if remedial actions are to be taken based on the data collected, we are reluctant to ask each respondent to complete such a long list of ratings.  Obviously, fatigue and completion rates are concerns.  In addition, long rating batteries tend to create their own "microclimates" and limit the generalizability of the findings.

In my previous post, Reducing Respondent Burden:  Item Sampling, I demonstrated how the ltm R package can be used when one has "planned missing data."  I can estimate a respondent's value on the underlying latent trait using only a subset of all the items.  The items that each respondent is asked to rate can be a random sample (item sampling) or chosen sequentially based on previous responses (computerized adaptive testing).  In the previous post, it was shown how the grm() function in the ltm package provides both detailed diagnostic information about all the items and accurate estimates of individual respondent's latent traits even when only half of the items have been randomly sampled.

We have taken advantage of the constraints imposed by the marketplace to identify a single underlying preference scale along which both consumers and the products/services they want can be aligned.  Our client now knows which actions will give them a competitive edge and which will yield additional revenues.  Moreover, we have learned the location of all the respondents along the same continuum.  We can use this latent trait as we would any other individual difference measure, for example, we can track changes over time or enter it as a variable in a regression equation.

Monday, April 8, 2013

Halo Effects vs. Intention-Laden Ratings: Separating Baby and Bathwater

Are halo effects real or illusory?  Much has been written arguing that rating scales contain extensive amounts of measurement bias.  Some tell us to avoid ratings altogether (What do customers really want?).  Others warn against the use of ratings scales without major adjustments (e.g., overcoming scale usage heterogeneity with the R package bayesm).  Obviously, by including the baby and bathwater idiom, I believe that there may be something "real" in those halo effects - real in the sense that a tendency to rate all the items higher or all the items lower may tells us much about one's intentions toward the entity being rated.

A Concept from Performance Appraisal

The concept of a halo effect flows from a dualism inherent in some theories of human perception.  There is a world independent of the observer, and there are reports about that world from human informants.  It is not an accident that the first use of the term "halo effect" came from human resources.  Personnel decisions are supposedly based on merit.  An employee's performance is satisfactory or it is not.  We do not collect supervisor ratings because we care about the feelings of the supervisor.  The supervisor is merely the measuring instrument, and in such cases, halo effects seem to be a form of measurement bias.  To be clear, halo effects are only bias when the informant is the measurement instrument and nothing more.  If we cared about the feelings of our observers, then removing the "halo effect" would constitute throwing the baby out with the bathwater.

Why Intention-Laden Ratings?

The title, Intention-Laden Ratings, comes from N.R. Hanson's work on theory-laden observation.  According to Hanson, when Galileo looked at the moon, he did not see discontinuities on the lunar surface.  He saw craters; he saw the quick and violent origins of these formations.  Similarly, the phrase "intention-laden ratings" is meant to suggest that ratings are not simply descriptive. Observations are not theory-free, and ratings are not intention-free. Perception serves action, and ratings reflect intentions to act. Failure to understand this process dooms us to self-defeating attempts to control for halo effects when what is called "halo" is the very construct that we are trying to measure.

We seem to forget that we are not reading off numbers from a ruler. People provide ratings, and people have intentions. I intend to continue using R for all my statistical analysis. I climbed up that steep learning curve, and I am not going back to SPSS or SAS. Still, I would not give top-box ratings to every query about R. R has its own unique pattern of strengths and weaknesses, its own "signature." There may be considerable consensus among R users due to common experiences learning and using R and because we all belong to an R community that shares and communicates a common view. Moreover, I would expect that those who have not made a commitment to R do not see a different relative pattern of pros and cons. They would, however, tend to give lower ratings across all the items.  What some might call "halo" in brand ratings can more accurately be described as intention-laden ratings, in particular, a commitment or intention to continued brand usage.  If it helps, think of intention-ladened as a form of cognitive dissonance.

R Enables Us to Provide an Example

An example might help explain my point.  Suppose that we collected five-point satisfaction ratings from 100 individuals who responded "yes" when asked if they use R for at least some portion of their statistical computing.  Let us say that we asked for nine ratings tapping the features that one might mention when recommending R or when suggesting some other statistical software.  Here is a summary of the nine ratings.

Number of Respondents Giving Each Score 1-5
var mean sd 1 2 3 4 5
1 4.59 0.89 2 2 9 9 78
2 4.46 1.08 4 4 9 8 75
3 4.57 0.84 1 2 11 11 75
4 3.79 1.30 6 12 25 11 46
5 3.84 1.24 7 4 31 14 44
6 3.64 1.35 12 5 28 17 38
7 3.52 1.42 10 19 18 15 38
8 3.60 1.37 12 10 19 24 35
9 3.53 1.32 11 9 28 20 32

This is what you generally find from users of a brand.  These users are satisfied with R, as you can see from the large number under the top-box column (labeled "5").  R seems to be doing particularly well on the first three rating items.  However, there seems to be some users who are less happy with the features tapped by the later ratings, especially the last three or four items.  One should note that the range of the means are somewhat limited (4.59-3.52=about 1 point on a five point scale).  I draw your attention to this because many do not realize that this one point difference in the means represents more than a doubling of the percentage in top-box ratings (78/38=more than 2).

I have not provided labels or names for the nine ratings.  Can you?  What are the attributes getting the highest ratings?  Is it cost, availability, or graphics?   It certainly is not any of the features mentioned by Patrick Burns in his R Inferno.  No one is suggesting that those incomprehensible error messages are a plus for R.  At a more general level, the lowest ratings might be ease of learning, documentation, or the absence of a GUI.  Of course, none of this is unique to R.  What are the pros and cons of using SPSS or SAS or any other statistical software?

Forget software, what does a SmartPhone do well and when does one experience problems?  Ask users why they use an iPhone, and ask nonusers why they do not use an iPhone.  Now ask both buyers and rejecters to rate all the attributes just mentioned.  The ratings from both groups will follow the same rank order; however, the buyers will give uniformly higher ratings across all the attributes.  Every product has its own unique pattern of good and bad points.  It is all on the web in product reviews and social media.

Now, where's the halo effect?  We will need to look at the interconnections among the ratings, for example, the correlation matrix.

var 1 2 3 4 5 6 7 8 9
1 1.00 0.36 0.41 0.54 0.21 0.28 0.27 0.37 0.33
2 0.36 1.00 0.36 0.31 0.24 0.45 0.36 0.30 0.35
3 0.41 0.36 1.00 0.27 0.20 0.31 0.20 0.26 0.32
4 0.54 0.31 0.27 1.00 0.44 0.33 0.29 0.49 0.53
5 0.21 0.24 0.20 0.44 1.00 0.27 0.26 0.30 0.45
6 0.28 0.45 0.31 0.33 0.27 1.00 0.47 0.49 0.54
7 0.27 0.36 0.20 0.29 0.26 0.47 1.00 0.38 0.51
8 0.37 0.30 0.26 0.49 0.30 0.49 0.38 1.00 0.41
9 0.33 0.35 0.32 0.53 0.45 0.54 0.51 0.41 1.00

What should you look for in this correlation matrix?  First, the correlations are all positive, therefore, we believe that there might be a strong first principal component.  However, the correlations are not uniform; they range from 0.54 to 0.20.  The correlations among the last four ratings seem somewhat higher than the others.  Still, we do not wish to forget that these are the ratings with some of the largest standard deviations.  When we examined the frequency distributions earlier, we saw the sizable top-box percentages for the first few ratings.  Such restriction of range can attenuate correlations.  That is, the correlations might have been more uniform had we not encountered such severe truncation at the higher levels for the first few ratings.

A factor analysis might help us uncover the structure underlying these correlations.  We start by looking at the first principal component.  It accounts for 43% of the total variation.  A strong first principal component reflects our finding that all the correlations were positive and many were sizable (between 0.20 and 0.54).  As the percentage of the total variation accounted for by the first principal component grows, we will become more convinced that the perceptions of R can be explained by a single evaluative judgment. 

Although I have used the term "factor analysis," the pattern matrix below with standardized loading came from a principal component analysis where we used a varimax rotation of the first three principal components.

PC1 PC3 PC2
1 0.04 0.39 0.75
2 0.57 0.00 0.52
3 0.20 0.04 0.78
4 0.12 0.78 0.38
5 0.19 0.76 -0.02
6 0.79 0.19 0.20
7 0.78 0.21 0.04
8 0.44 0.48 0.25
9 0.58 0.56 0.14
PC1 PC3 PC2
SS loadings 2.18 1.97 1.71
Proportion Var 0.24 0.22 0.19
Cumulative Var 0.24 0.46 0.65

The first three principal components account for 65% of the variation.  We seem to be able to separate the nine ratings into three overlapping groups:  the first three ratings (1-3), the next two ratings (4-5), and the last four ratings (6-10).  However, there are a number of ratings with loadings on two varimax-rotated principal components (e.g., variables 2, 8, and 9).  So, what do we have here?  We can argue for the existence of three separate and orthogonal components by discounting the dual loading of our three ratings as poorly written items.  For instance, we might claim that if the items had been written more clearly, they would have loaded on only one of the rotated principal components.  The factor structure is not definitive and requires a considerable amount of "explaining away" in order to maintain a three-factor interpretation.

So far, we have run the traditional analyses.  Let us finish with a bifactor rotation that seeks to fit a factor model with both a general halo effect and specific orthogonal factors.  You can think of it as removing the general factor and then running a second factor analysis on whatever correlation remains.  You can find the details in a previous post.  You should note in the figure below that all nine ratings have general factor loadings from g.  In addition, we have the contribution of specific orthogonal factors F1*, F2*, and F3*.  All nine ratings have some correlation due to a common general factor g, and some ratings have even higher correlations because of the specific factors.  We still do not know if that general factor represents something real and important (baby) or can be attributed to measurement bias (bathwater).  All we have is the correlation matrix, and the correlation matrix alone will not resolve such indeterminacy.



The "Truth" is Revealed

I have not mentioned that all the R code needed to replicate this analysis is listed at the end of this post.  This was temporarily "hidden" from you since the data are not real but simulated using a function from the r package ltm.  I used a graded response model to generate the nine ratings.  However, instead of describing how the data were generated, I will first fit a graded response model to the nine ratings.  Once you see the "results," it will be easier to understand how the data were created.

I introduced the graded response model in an earlier post.  It is relatively straightforward.  We start with an underlying dimension along which respondents can be arrayed.  This is an individual difference dimension, so you should be thinking about seriation, ordination, ranking, or just lining up single file.  For example, I could ask my 100 respondents to stand in a line from shortest to tallest or least smart to most smart.  Height is an individual difference dimension, and so is intelligence.  But height explains behavior, such as, Einstein's inability to reach a book on the upper shelf of his bookcase.  Saying that Einstein was not smart enough to solve a physics problem explains nothing.  [For a more complete treatment, see Borsboom The Theoretical Status of Latent Variables, 2003.]

Failure to understand this distinction will result in confusion.  Individual difference dimensions are constructed to differentiate among individuals who often are very different from each other at different locations along the continuum.  Consider the spelling bee.  The knowledge and the skill needed to spell easy words is very different from the knowledge and skill needed to spell difficult words.  Yet, we can array individuals along a dimension of spelling ability, understanding that very different cognitive processing is being used by different ability spellers at different points along the scale.  Spelling is just one example.  You can substitute your favor comparative dimension and ask if they same criteria are used to differentiate among individuals at the lower and upper ends of the measure (e.g., physical fitness, cooking skills, or programming ability).

Returning to R, individuals with more negative opinions of R have different knowledge and usage experiences than those having favorable impressions.  Thus, if I wanted to differentiate among all respondents regardless of how much they liked or disliked R, I would need to include both some very "easy" items (e.g., downloading and installing) and an array of more "difficult" ratings (e.g., debugging and interpreting error messages).  When I say easy, I mean easy for R to achieve high levels of satisfaction among users.  If a respondent does not believe that R is easy to download and install, they do not have a favorable impression of the software.

This is a latent trait or item response model.  Individuals can be arrayed along a single dimension, and ratings items can be placed along that same dimension.  Actually, it is not the item that is placed along the continuum, but the separate scale levels.  According to the model, an individual with a certain level of R favorableness reads an item and decides which score best reflects their assessment using the five-point scale.  The pro-R individual might look at the top-box and say "Yes, that describes me" or "No, a four is more accurate."  It is all easier to understand if we look at some coefficients.

Extrmt1 Extrmt2 Extrmt3 Extrmt4 Dscrmn
Item 1 -2.47 -2.14 -1.46 -1.03 2.32
Item 2 -2.24 -1.80 -1.24 -0.91 2.06
Item 3 -3.34 -2.68 -1.58 -1.02 1.73
Item 4 -2.02 -1.22 -0.32 0.06 1.96
Item 5 -2.51 -2.10 -0.43 0.18 1.27
Item 6 -1.60 -1.31 -0.20 0.41 1.85
Item 7 -2.00 -0.91 -0.16 0.44 1.48
Item 8 -1.62 -1.10 -0.37 0.50 1.79
Item 9 -1.57 -1.10 -0.07 0.61 2.11

All these coefficients (except Dscrmn=discrimination index, see previous post for details) can be interpreted as z-scores with mean=0 and standard deviation=1.  They are the cutoffs for assigning scores to the rating items.  If I had a very negative impression of R, say two standard deviations below the mean, what ratings would I give?  Item #1 requires a z-score below -2.47 before a "1" was assigned.  I would not give a "2" because that score would be assigned only if the z-score were between -2.47 and -2.14.  So, I would give a "3" to Item #1.  Item 2 gets a "2" and so on.  The last items are all given ones.

To test your understading, how would a person with a z-score of +1 rate the nine items?   It should be clear that such an individual would give all fives to all nine items for a total score of 45.  And that is what 10% of our respondents did.  Our nine items were not very "difficult" because even an average person with z=0 would have given three 5's and six 4's.  We seem to have low standards and some serious grade inflation.  It is easy for R to do well when these nine items are the criteria.  We need some more difficult items, meaning items that fewer respondents will give top-box ratings.  Like spelling bees, the difficult items that differentiate users at the upper end may look very different from those items that differentiate users in the middle of the scale.

Where Did the Factors Come From?

You might be wondering where the three factors came from if the data had been generated using a single dimension?  First, the factor structure is not the simple structure shown in textbook discussions of varimax rotations.  One must ignore the fact that one-third of the items load on multiple factors.  Second, although the match is far from perfect, there is a correspondence between the factors and differences in item difficulty.  Correlations are sensitive to skewness in the distributions of the variables so that one can mistakenly conclude that they have found factors when the differences in correlations are the result of differences in item means creating skewed distributions.

One last point needs to be made.  I wanted this data set to be realistic, so I did not assume that all customers were homogeneous.  Instead, it is more likely that the customer base for any product is composed of different customer types:  the hostages who are unhappy but cannot switch, the loyals who find special joy in the product, and the largest group of customers who are satisfied so that they are neither actively looking for alternatives nor are they brand advocates.

The rmvordlogis function that generated the 100 respondents takes cutpoints for the items in an argument called thetas and a set of latent trait scores for respondents in an argument called z.vals.  You can see in the code below that my z is a mixture of three different normal distributions in the underlying latent trait.  The first group, z1, with 65 of the 100 respondents represents the normally distributed satisfied customer with mean=0 and standard deviation=1.  Then I added 5 respondents in z2 who were hostages with mean=-2 and a smaller standard deviation of 0.5.  Finally, I included 30 loyal customers in z3 with mean=2 and SD=0.5.  This was done in order to produce a more realistic distribution with a positive skew on the latent trait.  Only when the switching costs are excessive do we find customer bases with sizable percentages of unhappy customers.

Saving the Baby

One needs to be careful when generalizing from performance appraisals that seek an unbiased observer to ratings of brand performance where the "bias" is the entity being measured.  Controlling for halo effects might make sense when we wish to exclude the observer's feelings.  Brand ratings are not one of those cases.    The general factor is not measurement bias (the halo effect) but brand equity, and that is what we wish to measure.  If we begin with a clear conception of the underlying construct, we will not make this mistake.

Measurement starts with a statistical model seeking to capture the data generation process.  As we have seen, three forces are at work generating brand ratings:  the product, the user, and the environment.  First, the product is real with affordances and constraints.  Every product has its own profile of strength and weaknesses that exist independent of user perception.  Second, customers may reach different overall product evaluations since each has his or her own needs and usage experiences.  However, such differences can be accommodated in the overall level of all the ratings and there is no need to change the relative rankings of the features.  The product remains more Feature A than Feature B with lower ratings on both features.  Third, all of this is shared through messaging by the brand, reviews in the press, and talk on social media.  The result is a common view of the strengths and weaknesses of each product, which I have called the "brand signature."  Everyone will not give the same ratings, but those ratings will follow the same pattern of higher and lower relative scores.

Finally, I generated a data set using the graded response model and showed how it produces results not unlike what is usually seen when we collect brand ratings.  It is the case that the product features or services that are easy for a brand to provide differ in kind or type from the product features or services that are difficult for a brand to provide.  For example, most would agree that R scores well on availability.  There are no payments that need to be authorized.  It is easy to download the base package and install updates or additional packages.  But the documentation is not quite at the level of an SPSS or a SAS.  Yet, documentation and availability are not two separate individual difference dimensions because availability will always be rated higher than documentation. Your score on the single latent trait and the relative positive of the items on the same scale is sufficient to predict all your ratings.  Documentation and availability denote different product aspects that function together as a single entity to differentiate satisfied from unsatisfied users.

Individual difference dimensions are heterogeneous because they attempt to differentiate at different ends of the scale. It is easy to misinterpret these mean level differences as if they were different factors or latent variables. We want to avoid making this mistake. The graded response model is the correct model specification for such individual difference dimensions.


 # Need two R packages
 library(psych)
 library(ltm)
  
 #use function from ltm to generate random data
 thetas<-NULL
 thetas[[1]]<-c(-4.0, -3.1, -2.1, -1.1, 1.2)
 thetas[[2]]<-c(-4.1, -3.2, -2.2, -1.2, 1.2)
 thetas[[3]]<-c(-3.9, -2.9, -1.9, -1.0, 1.2)
 thetas[[4]]<-c(-2.9, -1.9, 0.1, 1.1, 1.2)
 thetas[[5]]<-c(-2.8, -1.8, 0.2, 1.2, 1.2)
 thetas[[6]]<-c(-2.7, -1.7, 0.3, 1.3, 1.2)
 thetas[[7]]<-c(-2.1, -1.1, 0.1, 1.5, 1.2)
 thetas[[8]]<-c(-2.2, -1.2, 0.2, 1.6, 1.2)
 thetas[[9]]<-c(-2.3, -1.3, 0.3, 1.7, 1.2)
  
 #set seed for replication
 set.seed(4972)
  
 z1<-rnorm(65)
 z2<-rnorm(5, -2, 0.5)
 z3<-rnorm(30, 2, 0.5)
 z<-c(z1,z2,z3)
  
 ratings<-rmvordlogis(100, thetas, model="grm", IRT=FALSE, z.vals=z)
  
 #describe statistics and correlations
 describe(ratings)
 round(cor(ratings),2)
  
 #factor analyses
 bifactor<-omega(ratings, nfactor=3, plot=FALSE)
 bifactor
 omega.diagram(bifactor, main="")
 principal(ratings, nfactors=3)
  
 #graded response model
 descript(ratings)
 model<-grm(ratings)
 model