Pages

Tuesday, January 8, 2013

Item Response Modeling of Customer Satisfaction: The Graded Response Model


After several previous posts introducing item response theory (IRT), we are finally ready for the analysis of a customer satisfaction data set using a rating scale.  IRT can be multidimensional, and R is fortunate to have its own package, mirt, with excellent documentation (R.Philip Chalmers).  But, the presence of a strong first principal component in customer satisfaction ratings is much a common finding that we will confine ourselves to a single dimension, unless our data analysis forces us into multidimensional space.  And this is where we will begin this post by testing the assumption of unidimensionality.
Next, we must run the analysis and interpret the resulting estimates.  Again, R is fortunate that Dimitris Rizopoulos has provided the ltm package.  We will spend some time discussing the results because it takes a couple of examples before it becomes clear that rating scales are ordinal, that each item can measure the same latent trait differently, and that the different items differentiate differently at different locations along the individual difference dimension.  That is correct, I did say “different items differentiate differently at different locations along the individual difference dimensions.”

Let me explain.  We are measuring differences among customers in their levels of satisfaction.  Customer satisfaction is not like water in a reservoir, although we often talk about satisfaction using the reservoir or stockpiling metaphors.  But the stuff that brands provide to avoid low levels of customer satisfaction is not the stuff that brands provide to achieve high levels of customer satisfaction.  Or to put it differently, that which upsets us and creates dissatisfaction is not the same as that which delights us and generates high positive ratings.  Consequently, items measuring the basics will differentiate among customers at the lower end of the satisfaction continuum, and items that tap features or services that exceed our expectations will do the same at the upper end.  No one item will cover the entire range equally well, which is why we have a scale with multiple items, as we shall see.

Overview of the Customer Satisfaction Data Set
Suppose that we are given a data set with over 4000 respondents who completed a customer satisfaction rating scale after taking a flight on a major airline.  The scale contained 12 ratings on a five-point scale from 1=very dissatisfied to 5=very satisfied.  The 12 ratings can be separated into three different components covering the ticket purchase (e.g., online booking and seat selection), the flight itself (e.g., seat comfort, food/drink, and on-time arrival/departure), and the service provided by employees (e.g., flight attendants and staff at ticket window or gate).

In the table below, I have labeled the variables using their component names.  You should note that all the ratings tend to be concentrated in the upper two or three categories (negative skew), but this is especially true for the purchase and service ratings with the highest means.  This is a common finding as I have noted in an earlier post.
 

Proportion Rating Each Category on 5-Point Scale
Descriptive Statistics
1
2
3
4
5
mean
sd
skew
Purchase_1
0.005
0.007
0.079
0.309
0.599
4.49
0.72
-1.51
Purchase_2
0.004
0.015
0.201
0.377
0.404
4.16
0.82
-0.63
Purchase_3
0.007
0.016
0.137
0.355
0.485
4.30
0.82
-1.09
Flight_1
0.025
0.050
0.205
0.389
0.330
3.95
0.98
-0.86
Flight_2
0.022
0.055
0.270
0.403
0.251
3.81
0.95
-0.60
Flight_3
0.024
0.053
0.305
0.393
0.224
3.74
0.94
-0.52
Flight_4
0.006
0.022
0.191
0.439
0.342
4.09
0.81
-0.66
Flight_5
0.048
0.074
0.279
0.370
0.229
3.66
1.06
-0.63
Flight_6
0.082
0.151
0.339
0.259
0.169
3.28
1.16
-0.23
Service_1
0.002
0.008
0.101
0.413
0.475
4.35
0.71
-0.91
Service_2
0.004
0.013
0.091
0.389
0.503
4.37
0.74
-1.17
Service_3
0.009
0.018
0.147
0.422
0.405
4.19
0.82
-0.97

 

And How Many Dimensions Do You See in the Data?
We can see the three components in the correlation matrix below.  The Service ratings form the most coherent cluster, followed by Purchase and possibly Flight.  If one were looking for factors, it seems that three could be extracted.  That is, the three service items seem to “hang” together in the lower right-hand corner.  Perhaps one could argue for a similar clustering among the three purchase ratings in the upper left-hand corner.  Yet, the six Flight variables might cause us to pause because they are not that highly interrelated.  But they do have lower correlations with the Purchase ratings, so maybe Flight will load on a separate factor given the appropriate rotation.  On the other hand, if one were seeking a single underlying dimension, one could point to the uniformly positive correlations among all the ratings that fall not that far from an average value of 0.39.  Previously, we have referred to this pattern of correlations as a positive manifold.

P_1
P_2
P_3
F_1
F_2
F_3
F_4
F_5
F_6
S_1
S_2
S_3
Purchase_1
 
0.43
0.46
0.34
0.30
0.33
0.39
0.28
0.25
0.39
0.40
0.37
Purchase_2
0.43
0.46
0.37
0.43
0.33
0.36
0.31
0.34
0.43
0.40
0.42
Purchase_3
0.46
0.46
 
0.29
0.36
0.34
0.41
0.29
0.30
0.38
0.44
0.45
Flight_1
0.34
0.37
0.29
 
0.37
0.43
0.37
0.45
0.35
0.44
0.44
0.40
Flight_2
0.30
0.43
0.36
0.37
0.42
0.38
0.35
0.45
0.40
0.39
0.36
Flight_3
0.33
0.33
0.34
0.43
0.42
0.52
0.38
0.40
0.33
0.38
0.35
Flight_4
0.39
0.36
0.41
0.37
0.38
0.52
0.35
0.35
0.35
0.37
0.37
Flight_5
0.28
0.31
0.29
0.45
0.35
0.38
0.35
0.38
0.37
0.39
0.40
Flight_6
0.25
0.34
0.30
0.35
0.45
0.40
0.35
0.38
 
0.40
0.38
0.35
Service_1
0.39
0.43
0.38
0.44
0.40
0.33
0.35
0.37
0.40
 
0.66
0.51
Service_2
0.40
0.40
0.44
0.44
0.39
0.38
0.37
0.39
0.38
0.66
0.70
Service_3
0.37
0.42
0.45
0.40
0.36
0.35
0.37
0.40
0.35
0.51
0.70
 

 
Do we have evidence for one or three factors?  Perhaps the scree plot can help.  Here we see a substantial first principal component accounting for 44% of the total variation and five times larger than the second component.  And then we have a second and third principal component with values near one.  Are these simply scree (debris at the base of a hill) or additional factors needing only the proper rotation to show themselves?
The bifactor model helps to clarify the structure underlying these correlations.  The uniform positive correlation among all the ratings is shown below in the sizeable factor loadings from g (the general factor).  The three components, on the other hand, appear as specific factors with smaller loadings.  And how should we handle Service_2, the only item with a specific factor loading higher than 0.4?  Service_2 does seem to have the highest correlations across all the items, yet it follows the same relative pattern as the other service measures.  The answer is that Service_2 asks about service at a very general level so that at least some customers are thinking not only about service but about their satisfaction with the entire flight.  We need to be careful not to include such higher-order summary ratings when all the other items are more concrete.  That is, all the items should be written at the same level of generality.

IRT would prefer the representation in the bifactor model and would like to treat the three specific components as “nuisance” factors.  That is, IRT focuses on the general factor as a single underlying latent trait (the foreground) and ignores the less dramatic separation among the three components (the background).  Of course, there are alternative factor representations that will reproduce the correlation matrix equally well.  Such are the indeterminacies inherent in factor rotations.

The Graded Response Model (grm)
This section will attempt a minimalist account of the fitting of the graded response data to these 12 satisfaction ratings.  As with all item response models, the observed item response is a function of the latent trait.  In this case, however, we have a rating scale rather than a binary yes/no or correct/incorrect; we have a graded response between very dissatisfied and very satisfied.  The graded response model assumes only that the observed item response is an ordered categorical variable.  The rating values from one to five indicate order only and nothing about the distance between the values.  The rating scale is treated as ordered but not equal interval.

As a statistical equation, the graded response model does not make any assertions concerning how the underlying latent variable “produces” a rating for an individual item.  Yet, it will help us to understand how the model works if we speculate about the response generation process.  Relying on a little introspection, for all of us have completed rating scales, we can imagine ourselves reading one of these rating items, for example, Service_2.  What might we be thinking?  “I’m not unhappy with the service I received; in fact, I am OK with the way I was treated.  But I am reluctant to use the top-box rating of five except in those situations where they go beyond my expectation.  So, I will assign Service_2 a rating of four.”  You should note that Service_2 satisfaction is not the same as the unobserved satisfaction associated with the latent trait.  However, Service_2 satisfaction ought to be positively correlated with latent satisfaction.  Otherwise, the Service_2 rating would tell us nothing about the latent trait.


Everything we have just discussed is shown in the above figure.  The x-axis is labeled “Ability” because IRT models saw their first applications in achievement testing.  However, in this example you should read “ability” as latent trait or latent customer satisfaction.  The x-axis is measured in standardized units, not unlike a z-score.  Pick a value on the x-axis, for example, the mean zero, and draw a perpendicular line at that point.  What rating score is an average respondent with a latent trait score equal to zero likely to give?  They are not likely to assign a “1” because the curve for 1 has fallen to probability zero.  The probabilities for “2” or “3” are also small.  It looks like “4” or “5” are the most likely ratings with almost equal probabilities of approximately one half.  It should be clear that unlike our hypothetical rater described in the previous paragraph, we do not see any reluctance to use the top-box category for this item.
Each item has its own response category characteristic curve, like the one shown above for Service_2, and each curve represents the relationship between the latent trait and the observed ratings.  But what should we be looking for in these curves?  What would a “good” item look like, or more specifically, is Service_2 a good item?  Immediately, we note that the top-box is reached rather “early” along the latent trait.  Everyone above the mean has a greater probability of selecting “5” than any other category.  It should be noted that this is consistent with our original frequency table at the beginning of this post.  Half of the respondents gave Service_2 a rating of five, so Service_2 is unable to differentiate between respondents at the mean or one standard deviation above the mean or two standard deviations above the mean.

Perhaps we should look at a second item to help us understand how to interpret item characteristic curves.  Flight_6 might be a good choice for a second item because the frequencies for Flight_6 seem to be more evenly spread out over the five category values: 0.082, 0.151, 0.339, 0.259, and 0.196.



And that is what we see, but with a lot of overlapping curves.  What do I mean?  Let us draw our perpendicular at zero again and observe that an average respondent is only slightly more likely to assign a “3” than a “4” with some probability that they would give instead a “2” or a “5.”  We would have liked these curves to have been more "peaked" and with less overlap.  Then, there would be much less ambiguity concerning what values of the latent trait are associated with each category of the rating scale.  In order to “grasp” this notion, imagine grabbing the green “3” curve at its highest point and pulling it up until the sides move closer together.  Now, there is a much smaller portion of the latent trait associated with a rating of 3.  When all the curves are peaked and spread over the range of the latent trait, we have a “good” item.

Comparing the Parameter Estimates from Different Items
Although one takes the time to examine carefully the characteristic curve for each item, there is an easier method for comparing items.  We can present the parameter estimates from which these curves were constructed, as shown below.


Coefficients:
Extrmt1
Extrmt2
Extrmt3
Extrmt4
Dscrmn
Purchase_1
-3.72
-3.20
-1.86
-0.39
1.75
Purchase_2
-3.98
-3.00
-1.11
0.29
1.72
Purchase_3
-3.57
-2.85
-1.41
0.01
1.73
Flight_1
-2.87
-2.07
-0.88
0.58
1.64
Flight_2
-3.08
-2.14
-0.64
0.95
1.56
Flight_3
-3.05
-2.15
-0.50
1.11
1.53
Flight_4
-3.94
-2.87
-1.17
0.55
1.58
Flight_5
-2.53
-1.79
-0.46
1.08
1.52
Flight_6
-2.26
-1.21
0.21
1.51
1.35
Service_1
-3.46
-2.76
-1.46
0.02
2.50
Service_2
-2.91
-2.33
-1.39
-0.07
3.13
Service_3
-2.86
-2.32
-1.17
0.24
2.45
1 vs. 2-5
1-2 vs. 3-5
1-3 vs. 4-5
1-4 vs. 5

 
The columns labeled with the prefix “Extrmt” are the extremity parameters, the cutpoints that separate the categories as shown in the bottom row (e.g., 1 vs. 2-5).  This might seem confusing at first, so we will walk through it slowly.  The first column, Extrmt1, separates the bottom-box from the top-four categories (1 vs. 2-5).  So, for Flight_6, anyone with a latent score of -2.26 has a 50-50 chance of assigning a rating of 1.  And what is the latent trait score that yields a 50-50 chance of selecting 1 or 2 versus 3, 4, or 5?  Correct, the value is -1.21.  Finally, the latent score for a respondent to have a 50-50 chance of giving a top-box rating for Flight_6 is 1.51.

Our item response categories characteristic curves for Flight_6 show the 50% inflection point only for the two most extreme categories, ratings of one and five.  The remaining curves are constructed by subtracting adjacent categories.  Before we leave, however, you must notice that the cutpoints for Service_2 all fall toward the bottom of the latent trait distribution.  As we noted before, even the cutpoint for the top-box is not quite zero because more than half the respondents gave a rating of five.
What about the last column with the estimates of the discrimination parameters?  We have already noted that there is a benefit when the characteristics curves for each of the category levels have high peaks.  The higher the peaks, then the less overlap between the category values and the greater the discrimination between the rating scores.  Thus, although the ratings for Flight_6 span the range of the latent trait, these curves are relatively flat and their discrimination is low.  Service_2, on the other hand, has a higher discrimination because its curves are more peaked, even if those curves are concentrated toward the lower end of the latent trait.

 
Item Information

“So far, so good,” as they say.  We have concluded that the items have sufficiently high correlation to justify the estimate of a single latent trait score.  We have fit the graded response model and estimated our latent trait.  We have examined the item characteristic curves to assess the relationship between the latent trait and each item rating.  We were hoping for items with peaked curves spread across the range of the latent trait.  We were a little disappointed.  Now, we will turn to the concept of item information to discover how much we can learn about the latent trait from each item.
We remember that the observed ratings are indicators of the latent variable, and each item provides some information about the underlying latent trait.  The term information is used in IRT to indicate the reciprocal of the precision with which the latent trait is measured.  Thus, a high information value is associated with a small standard error of measurement.  Unlike classical test theory with its single value of reliability, IRT does not assume that measurement precision is constant for all levels of the latent trait.  The figure below displays how well each item performs as a function of the latent trait.

What can we conclude from this figure?  First, as we saw at the beginning of the post when we examined the distributions for the individual items, we have a ceiling effect with most of our respondents using only the top two categories of our satisfaction scale.  This is what we observe from our item information curves.  All the curves begin to drop after the mean (ability = 0).  To be clear, our latent trait is estimated using all 12 items, and we get better differentiated from the 12 items than from any item by itself.  However, let me add that almost 5% of the respondents gave ratings of five to every item and obtained the highest possible score.  Thus, we see a ceiling effect for the latent trait, although a smaller ceiling effect than that for the individual items.  Still we could have benefited from the inclusion of a few items with lower mean scores that were measuring features or services that are more difficult to deliver.
The green curve yielding the most information at low levels of latent trait is our familiar Service_2.  Along with it are Service _1 (red) and Service_3 (blue).  The three Purchase ratings are labeled 1, 2, and 3.  The six ratings of the Flight are the only curves providing any information in the upper ranges of the latent variable (numbered 4-9).  All of this is consistent with the item distributions.  The means for all the Purchase and Service ratings are above 4.0.  The means for the Flight items are not much better, but most of these means are below 4.0.


Summary
So that is the graded response model for a series of ratings measuring a single underlying dimension.  We wanted to be able to differentiate customers who are delighted from those who are disgusted and everyone in between.  Although we often speak about customer satisfaction as if it were a characteristic of the brand (e.g., #1 in customer satisfaction), it is not a brand attribute.  Customer satisfaction is an individual difference dimension that spans a very wide range.  We need multiple items because different portion of this continuum have different definitions.  It is failure to deliver the basics that generates dissatisfaction, so we must include ratings tapping the basic features and services.  But it is meeting and exceeding expectation that produces the highest satisfaction levels.  As was clear from the item information curves, we failed to include such more difficult to deliver items in our battery of ratings.


Appendix with R-code

library(psych)
describe(data)                         # means and SDs for data file with 12 ratings
cor(data)                                 # correlation matrix
scree(data, factors=FALSE)  # scree plot
omega(data)                           # runs bifactor model

library(ltm)
descript(data)                          # runs frequency tables for every item
fit<-grm(data)                          # graded response model
fit                                             # print cutpoints and discrimination

plot(fit)                                     # plots item category characteristic curves
plot(fit, type="IIC")                  # plots item information curves

# next two lines calculate latent trait scores and assign them to variable named trait
pattern<-factor.scores(fit, resp.pattern=data)
trait<-pattern$score.dat$z1

7 comments:

  1. Dear Joel,

    I find your posts on Customer Satisfaction (CS) and IRT models very interesting. I'm a researcher currently working on the very same topic, specifically I'm employing multidimensional IRT (MIRT) models (compensatory as for now) to CS data. A paper is in review, that discuss interpretation and use of 1PL-2PL-M2PL models for CS analyses, focusing on item parameters and their possible role in improvements planning from the service provider's side.

    At the moment I'm working on a multidimensional GPCM, do you have any advice/related paper to give/suggest me?

    Kindest regards,

    Federico Andreis, PhD
    Università degli Studi di Milano
    federico.andreis@unimi.it

    ReplyDelete
    Replies
    1. Thank you for your kind comment. Perhaps I too quickly dismissed MIRT in the first paragraph of my post. However, the reference to Phil Chalmers' mirt R package is my recommendation as the place to start. The generalized partial credit model is one of the IRT models that Phil covers in some detail.

      Have you considered the extended Rasch model (e.g., the eRm r package)? I am making this suggestion because of your interest in the role of item parameters in improving services. I have found that specific factors from the bifactor model can be associated with mean level differences in the item ratings (e.g., the flight attributes tend to be rated lower than the service attributes). In such cases, diagnostic information comes from explaining item difficulty and not from the multidimensionality of the latent trait. A battery of customer satisfaction ratings can be sorted into categories not unlike the cells of an ANOVA design. The eRm is designed to decompose item difficulty and thus provide diagnostic information.

      Delete
  2. Thanks for your answer, Joel!

    I have used the eRm package but back when I was working on different topics, and just for the sake of estimation double-checks. I will get back to it.

    I've noticed (and later found in the literature, see "Takane, de Leeuw, 1987. On the relationship between item response theory and factor analysis of discretized variables" a confirmation thereof), a strong correlation between factor analysis loadings and discriminant parameters in IRT models. This helped providing more insight and a point of contact and comparison with classic methods employed in CS.

    I'm studying the mirt package by Phil Chalmers right now, and see if it can help speeding up estimation (my MHwithinGibbs algorithms are way less efficient than MH-RM, I fear) and providing a nice way to obtain diagnostics.

    ReplyDelete
    Replies
    1. Hey Federico,

      The current version of mirt on CRAN is terribly slow at estimating the gcpm, however the development version on github is approximately 40x faster for the EM and about 15-20x faster for the MH-RM, so I'd recommend installing from source (I won't be releasing a new version till February).

      Also, mirt now has the ability to estimate similar models to the eRm package (e.g., LLTM) for modelling item difficulties directly, but also allows for estimating the item level slopes and person level covariates. This is done through the new mixedmirt() function, which at some point in time will also support multilevel IRT models (hence the inspiration for the name, 'mixed effects mirt'). Hope that helps, and all the best!

      Phil

      P.S. Thanks for the publicity @Joel! Very kind of you.

      Delete
    2. Thanks a lot, Phil!

      And may I say, it's a pleasure to get in touch with you, I do really appreciate your work!

      I'll check out the github version..

      (how did you get it to be 40x and 15/20x faster? mere code optimization? or did you re-write the routines in a lower language?)

      /f

      Delete
    3. Yes, I moved the computation of the Hessian and gradients into C++ code via Rcpp, which was the main bottle neck. That's actually how a lot of mirt is sped up, where about 5% of the package code is coded in C++ while the rest is in R for convenience and maintenance. I think it's a great 1-2 punch, and the way Rcpp (and family) is set up is really awesome and convenient to work with in package development. Cheers!

      Phil

      Delete
  3. The information written on Customer Satisfaction is good and IRT models very interesting. I like your way to present the information in sequence and in informative way. Thanks for sharing this...Event Planning Ideas

    ReplyDelete