A
good illustration is the work being done by National Institute of Health (NIH) over
the last decade on the Patient Reported Outcome Measurement Information System
(PROMIS). For example, when measuring
the upper extremity physical function of children, they get very specific and ask about the occurrence of everyday
activities: undoing Velcro, using a
mouse with the computer, buttoning shirts, pouring liquids from a pitcher, and
cutting paper with scissors. In order to
construct these scales and analyze such data, PROMIS has turned to
Item Response Theory (IRT). Neither dichotomous
(checklist) nor ordinal (frequency of occurrence) responses can be analyzed with statistical procedures, like factor analysis, as
if they were continuous.
OK,
so you get the first part of the title, but what about “Developing your
intuition?” This sounds like it belongs in a self-help book. But I have found
that it is very difficult to learn item response theory unless you understand
the motivation behind it. Perhaps it is because IRT is not a single statistical
model, but a family of increasing complex models and estimation techniques. At
times it seems like one is reading an old encyclopedia entry with heading after
heading dealing with one more complex topic after another. There are
one-parameter, two-parameter, and three-parameter models. Then, there are
polytomous items, and nonparametric models, and linear logistic test models,
and all the different estimation techniques (including Bayesian IRT). And when
you think you got it, someone tells you about multidimensional IRT. But placing
measurement on a firm foundation is too important for us to wait. So let’s see
if we can outline a framework that makes IRT seem reasonable and within which
we can begin to place all the different models and approaches.
Measurement
Scales Derived from Relative Comparisons
Harder
substances scratch softer substances, as every elementary student is taught.
Given an unknown substance, I can evaluate its relative hardness by attempting
to scratch it with an ordered set of standards of increasing hardness (e.g.,
gypsum < quartz < diamond). This is the rationale for the Mohs scale. It yields an
ordinal scale because all I know is relative hardness.
How
about physical fitness? Could we not identify a set of standardized physical tasks
of increasing difficulty and measure a person’s physical fitness as the most
difficult task they were able to pass? This is the idea underlying the Guttman scale. If I can order the tasks, then a person will
pass every task until they fail and then not pass any subsequent tasks. The only problem is that human behavior is
variable, so that it is possible for us to see random variation and inconsistent performance. By replacing the deterministic Guttman scale with a
probabilistic response, we can deal with random variation and focus on the likelihood of passing. This is the approach taken by item response theory.
Ultimately,
the goal is to get both criterion-reference and norm-referenced measurements.
If we include physical tasks that have real world implications (e.g., walking
up stairs, lifting heavy luggage into an overhead bin, running for a cab), we
will know something about what the person can and cannot do
(criterion-referenced). In addition, we have learned where the person places
relative to others in the sample (norm-referenced).
But
since I am a marketing researcher, perhaps we could substitute brand strength
for physical fitness and talk about brand equity or brand health. Specifically,
a strong brand or a healthy brand should be able to pass several tests. It
should have a favorable image, be in the consideration set during purchase
deliberations, be bought, have satisfied customers, and get recommended by its
users.
It’s All in the Response Pattern
Matrix
I
have generated some data consistent with the rank ordering of these five brand
tests in order to show you the R code and the output from an item response
model. I have provided all the R code
needed to generate some data and run the analysis at the end of the post in an
appendix. In addition, here is a link to
a Journal of Statistical Software article by Dimitris Rizopoulos. He works through all the same code when
analyzing five items from the LSAT (Section 3.1). I deliberately created a “matching” example
so that you could see two worked examples from different perspectives. Rizopoulos more fully discusses the code and
the output, while I try to "develop your intuition."
First, let us look at the frequency table for 200 respondents who gave Yes/No answers to the five
brand strength tests. There are five binary variables, so there are 32 possible
response patterns. We only see 21 patterns because the brands tests are not
independent. That is, if we had seen all 32 possible combinations with equal or
balanced cell frequencies, we would have concluded that the 5 tests were independent
and would have stopped our analysis. If it helps, you can think of this response pattern
matrix as a 2x2x2x2x2 factorial design or as a contingency table of the same
dimensions.
Favorable
|
Consider
|
Purchase
|
Satisfied
|
Recommend
|
Percent
|
Total Score
|
Latent Score
|
|
1
|
0
|
0
|
0
|
0
|
0
|
18.0%
|
0
|
-1.07
|
2
|
0
|
0
|
1
|
0
|
0
|
4.5%
|
1
|
-0.50
|
3
|
0
|
1
|
0
|
0
|
0
|
5.0%
|
1
|
-0.50
|
4
|
1
|
0
|
0
|
0
|
0
|
13.5%
|
1
|
-0.50
|
5
|
0
|
1
|
0
|
1
|
0
|
0.5%
|
2
|
-0.08
|
6
|
0
|
1
|
1
|
0
|
0
|
4.0%
|
2
|
-0.08
|
7
|
1
|
0
|
0
|
0
|
1
|
1.5%
|
2
|
-0.08
|
8
|
1
|
0
|
0
|
1
|
0
|
1.0%
|
2
|
-0.08
|
9
|
1
|
0
|
1
|
0
|
0
|
2.5%
|
2
|
-0.08
|
10
|
1
|
1
|
0
|
0
|
0
|
5.0%
|
2
|
-0.08
|
11
|
0
|
1
|
1
|
1
|
0
|
0.5%
|
3
|
0.31
|
12
|
1
|
0
|
0
|
1
|
1
|
0.5%
|
3
|
0.31
|
13
|
1
|
0
|
1
|
1
|
0
|
3.5%
|
3
|
0.31
|
14
|
1
|
1
|
0
|
0
|
1
|
2.0%
|
3
|
0.31
|
15
|
1
|
1
|
0
|
1
|
0
|
4.0%
|
3
|
0.31
|
16
|
1
|
1
|
1
|
0
|
0
|
6.0%
|
3
|
0.31
|
17
|
1
|
0
|
1
|
1
|
1
|
0.5%
|
4
|
0.72
|
18
|
1
|
1
|
0
|
1
|
1
|
1.5%
|
4
|
0.72
|
19
|
1
|
1
|
1
|
0
|
1
|
3.5%
|
4
|
0.72
|
20
|
1
|
1
|
1
|
1
|
0
|
7.5%
|
4
|
0.72
|
21
|
1
|
1
|
1
|
1
|
1
|
15.0%
|
5
|
1.26
|
Items
vary in their difficulty, and as a result, are best at measuring individual
difference near their location on the scale. On the one hand, easy items with lots
of respondents saying “Yes” separate individuals at the lower end of the scale.
On the other hand, difficult items with lots of respondents saying “No”
separate individuals at the higher end of the scale.
Before
we proceed, we need to assure ourselves that the five tests are all tapping one underlying dimension. Obviously, the tests were selected because we
believed that they were all measures of brand strength. Strong brands deliver
consistent value that customers are willing to pay for. When selecting the
tests, I was thinking of the purchase funnel, a theory about the steps in the
purchase process. Brands with favorable images tend to make it into the
consideration set, but they are not always purchased. Everyone who buys is not
necessarily satisfied with their purchase. Even satisfied customers don’t
always recommend. It is this “funneling process” that makes each of the steps
increasingly difficult for the brand to pass.
So,
I have a solid theoretical basis for believing that these tests tap the same
underlying individual difference dimension. Why the stress on the “individual difference”
modifier? The brand strength that we are
concerned with varies over customers. This
can be confusing because at times brand strength is measured over several
different brands and used as a brand characteristic. Perhaps you need to recall your experimental
design class where you studied two types of designs, between-subject and
within-subject, and their combination as mixed designs. There are many research areas where we gather
lots of data from subjects with the intent to learn something about
how the individual operates. In IRT we
have the extended Rasch model (eRm package), where the items are systematically
varied according to a design and the effects of item features estimated. But that is not what we are doing here. We are looking at perceptions of a single brand. Some respondents have a favorable opinion of
the brand, and others do not. Brand
strength is the dimension that differentiates the favorable respondents from the
not favorable respondents.
Now,
what about empirical evidence for the unidimensionality of these five brand
strength tests? Principal component analysis should
help. The first principal component
accounts for over 50% of the total variation and is 3.5 times the size of the
second principal component. There are more tests that the ltm package will run (e.g.
modified parallel analysis), but the magnitude of the first principal component is probably good enough for this example.
We
can also get a sense of the data by looking at the above table. It is ordered
by difficulty of the item and by the latent trait score associated with each
response pattern. The last column is the latent trait score, an index of brand strength associated with each response profile. The adjacent column is the
total score calculated as the number of Yes’s to the five brand tests. As one moves from left to right across the table,
the number of Yes’s (=1) decreases. The percentage of Yes’s falls from 67.5% for favorable to
54.5% to 47.5% to 34.5% to 24.5% for recommend. As one moves down the table, brand strength
increases, as does the number of Yes’s in each row.
Allow
me to deal with one side issue. If this were a “true” funnel, I would fix the
order of the questions from favorable image to recommendation and screen
respondents so that no one who did not purchase would be asked about
satisfaction. But I wanted to show the type of response patterns that you tend
to see in survey data. As you can see in the R code at the end of this post, I
generated random data as if this were a checklist with order randomized
separately for each respondent and no branching or screening between items. For
example, look at row 15 with 4% or 8 respondents. These 8 respondents indicate
that they would be satisfied with the brand, but would not purchase it. Is this
an inconsistent response pattern? Is it random variation? Or perhaps the brand
is too expensive for them? I would be satisfied driving a luxury car (check
“Yes” under Satisfied), if I could afford it (check “No” under Purchase). If it is the case that Purchase measures affordability, in addition to brand strength, then we might wish to remove that item from our scale.
They
key to achieving an intuitive sense of what an IRT is trying to accomplish is
the ability to “see” the relationship between the response pattern matrix and
the underlying latent trait. Once you get that, it becomes much easier. Items
tap different locations along the latent trait. How do we know that? Fewer
respondent give positive responses to more severe tests of the latent trait.
You need to “picture” the response pattern matrix and see the number of 0’s
increasing and the number of 1’s decreasing. Look at the above table again. I
have used yellow and green coloring so that you get the sense that the green is
the valley toward which “Yes” responses flow.
But
don’t forget that there are two parts: items and persons. Latent traits are
dimensions of individual differences. How are respondents separated by their response
patterns? Can you see it in the response pattern matrix? What happens as you
move down the table? More and more respondents begin to say “Yes” to the more
stringent tests of brand strength. What does it mean for a respondent to
perceive a brand as being a strong brand? They give more Yes’s, true, but lots
of respondent say “Yes” to the easy tests. Only those respondents with the most
positive brand perceptions say “Yes” to the hardest tests for a brand to pass.
Output from an Item Response Theory
Analysis
If
you recall, my intent was to develop your intuition and not review all of IRT.
Keeping with that goal, I will only run one type of IRT model and then try
to relate the output to the response pattern matrix.
We
will use the one of the many IRT packages in R. I selected the latent trait
model package, ltm, because it is both comprehensive
and relatively easy to use. More importantly, Dimitris Rizopoulos has gone out
of his way to provide extensive support, both in articles and in presentations.
He has written a lot, and it is all worth reading.
We
will run one of the more common IRT models, the two-parameter logistic model.
To understand what the two parameters are, we look at the item characteristic
curves. These are logistic functions, one for each of the 5 brand tests. Each
one shows the likelihood of checking "Yes" as a function of the
respondent's score on the underlying latent variable, called ability for
historical reasons (the first applications were in educational testing).
The
curves follow the same ordering that we saw in the response pattern matrix with
favorable (1) < consider (2) < purchase (3) < satisfied (4) <
recommend (5). They are defined by their location and slope.
Location
|
Slope
|
||
Item 1
|
Favorable
|
-0.57
|
2.36
|
Item 2
|
Consider
|
-0.15
|
1.98
|
Item 3
|
Purchase
|
0.09
|
1.65
|
Item 4
|
Satisfied
|
0.46
|
3.30
|
Item 5
|
Recommend
|
0.83
|
2.65
|
The
location estimates nicely separate the five brand tests. They indicate the
score on the latent variable that would yield a 50-50 chance of saying
"Yes" to each item. As we noted when we looked at the response
pattern matrix, these five brand tests do not do a good job of differentiating
respondents at the very top or bottom of the scale. You should remember that
18% gave the lowest possible responses and 15% gave the highest possible
responses. We see this in the location parameter that range from only -0.57 to
0.83. We would need to add easier items
(location below -.057) and harder items (location above 0.83) in order to
differentiate among these two large groups.
I should note that I am interpreting these location parameters as if
they were z-scores with a mean of 0 and a standard deviation of 1. Although one always needs to check the distribution of latent scores in their particular study, this is not a bad “rule of thumb” when you
have many items and a normally distributed latent variable.
The slope parameters indicate how
quickly the probability of saying "Yes" changes as a function of
changes in the latent trait. Item 4 measuring Satisfaction has the steepest
slope. In fact, its steepness causes the item curves to overlap, meaning that
there is a range of the latent trait where the likelihood of saying Satisfied
(4) is greater than the likelihood of saying Purchase (3). This is not a happy result. We would like for the five logistic curves to
be parallel (same slope) or at least for them not to overlap. That is, we would like for Satisfaction to be
a more severe test of brand strength than Purchase for everyone regardless of
their level of brand strength.
Perhaps these differences in slope are
due to random variation? What if we
constrained the five slopes to equal the same value, would we be able to
reproduce our observed response pattern matrix as well as when we allow the
slopes to be different? Sounds like a
likelihood ratio test using the anova() function, and with four degrees of
freedom (i.e., five different slope estimates reduced to one common slope), we find a p value of 0.232. The
location estimates change little when the slopes are constrained to be equal.
Location
|
Slope
|
||
Item 1
|
Favorable
|
-0.58
|
2.25
|
Item 2
|
Consider
|
-0.16
|
2.25
|
Item 3
|
Purchase
|
0.07
|
2.25
|
Item 4
|
Satisfied
|
0.49
|
2.25
|
Item 5
|
Recommend
|
0.87
|
2.25
|
The
common slope seems a reasonable compromise, and the logistic curves are now
parallel.
Respondents
and Items on the Same Scale
Our
goal from the beginning was to find some way to combine the five separate brand
strength tests into a single index.
Looking at the response pattern matrix, we knew that we had only 21 of
the 32 possible combinations of the five dichotomous Yes/No tests. Each of 21 response profiles could have yielded
a different latent variable score. Had
we rejected the test for the equality of the five slope parameters, we would
have had 21 different latent trait scores.
This is an important point. When
you look back to the response pattern matrix, you see all 21 possible response
profiles along with the total score (number of Yes’s) and the latent trait
score from the IRT model with all the slopes constrained to be equal.
There are duplicate latent scores for different response patterns because all five tests have equal slopes. In fact, there is a unique latent score for each total score. In this particular case, the relationship between the total score and the latent score appears to be linear. However, this will not generally occur, and the relationship between the observed total score and the unobserved latent score is often not linear.
The slope indicates how well the item is able to discriminant among the respondents. A flat slope tells us that the probability of saying "Yes" changes slowly with increases or decreases in the latent variable. A steep slope shows that the likelihood of "Yes" changes quickly over a small interval of the latent trait. The contribution that each item makes to the latent score depends on its slope (see section on scoring in this Wikipedia entry).
Respondents get latent scores, and items get latent scores too. Both items and respondents can be placed on the same scale. Had we more items and had those additional items filled in the gaps on both the lower and upper ends of the scale, our latent scores would looked more like z-scores and would have looked like the distribution of the underlying latent variable.
The table below summarizes these results and shows where the five tests fall in relation to the respondents.
There are duplicate latent scores for different response patterns because all five tests have equal slopes. In fact, there is a unique latent score for each total score. In this particular case, the relationship between the total score and the latent score appears to be linear. However, this will not generally occur, and the relationship between the observed total score and the unobserved latent score is often not linear.
The slope indicates how well the item is able to discriminant among the respondents. A flat slope tells us that the probability of saying "Yes" changes slowly with increases or decreases in the latent variable. A steep slope shows that the likelihood of "Yes" changes quickly over a small interval of the latent trait. The contribution that each item makes to the latent score depends on its slope (see section on scoring in this Wikipedia entry).
Respondents get latent scores, and items get latent scores too. Both items and respondents can be placed on the same scale. Had we more items and had those additional items filled in the gaps on both the lower and upper ends of the scale, our latent scores would looked more like z-scores and would have looked like the distribution of the underlying latent variable.
The table below summarizes these results and shows where the five tests fall in relation to the respondents.
Total Score
|
Latent Score
|
Percent
|
Item Score
| |
0
|
-1.07
|
18.0%
| ||
-0.58
|
Favorable
| |||
1
|
-0.50
|
23.0%
| ||
-0.16
|
Consider
| |||
2
|
-0.08
|
14.5%
| ||
0.07
|
Purchase
| |||
3
|
0.31
|
16.5%
| ||
0.49
|
Satisfied
| |||
4
|
0.72
|
13.0%
| ||
0.87
|
Recommend
| |||
5
|
1.26
|
15.0%
|
Conclusions:
Going beyond the Mathematics
Individual
Difference Dimensions. The dimensions that IRT uncovers
are between-person. Different respondents see the same brand differently. We can aggregate respondents and calculate the "brand strength" of many different brands. But now the analysis ignores individual differences among the respondents and focuses on the brands. It is important to maintain the distinction between brand-level and individual-level analyses.
Moreover, it is so easy to forget that what differentiates people at the lower end of the scale may not be what differentiates people at the upper end of the scale. Mathematical proficiency is assessed using arithmetic problems in the lower grades and algebra in the higher grades. As we have seen, different brand tests are necessary to measure low and high levels of brand strength.
Moreover, it is so easy to forget that what differentiates people at the lower end of the scale may not be what differentiates people at the upper end of the scale. Mathematical proficiency is assessed using arithmetic problems in the lower grades and algebra in the higher grades. As we have seen, different brand tests are necessary to measure low and high levels of brand strength.
Response
Pattern Matrix. Although we can look item by item, we learn much more about
individuals when we examine their pattern of responses across an array of items
tapping different portions of the underlying trait. If it helps, think of it as a form of triangulation. We also learn about the items. Sometimes what we learn is that one
or more items simply do not belong in the dimension and need to be removed. Often we learn that we have not adequately measured the entire range of our latent construct and need additional items, especially at the upper and lower ends of the scale.
Of course, the response pattern matrix becomes unwieldy as the number of items increase. For example, with 10 items the number of possible combinations is over a thousand and simply too large to be of any help. However, regardless of the IRT model or the number of items, you will be able to interpret the results because you have an intuitive understanding of the connection between the response patterns and the IRT parameters.
Of course, the response pattern matrix becomes unwieldy as the number of items increase. For example, with 10 items the number of possible combinations is over a thousand and simply too large to be of any help. However, regardless of the IRT model or the number of items, you will be able to interpret the results because you have an intuitive understanding of the connection between the response patterns and the IRT parameters.
Criterion-referenced
interpretation. I can do more with your response
profile than just locate your performance in comparison to others, although
such norm-referenced information is important. If I am careful, I can select
brand tests corresponding to milestones that I wish to achieve. Knowing how
favorable my brand perceptions are is valuable on its own because it provides
diagnostic information in that it might explain why the brand is not considered
during purchase. What moves a customer at the low end of the brand strength
scale is likely to be different than what moves a customer at the upper end of
the same scale.
Appendix: R code to generate data and run ltm
Appendix: R code to generate data and run ltm
#use orddata package to generate random data
library(orddata)
#probabilities for each brand test (location)
prob <- list(
c(35,65)/100,
c(45,55)/100,
c(55,45)/100,
c(65,35)/100,
c(75,25)/100
)
#slope for each logistic curve
loadings<-matrix(c(
.6,
.6,
.6,
.6,
.6),
5, 1, byrow=TRUE)
#creates correlation matrix as input
cor_matrix<-loadings %*% t(loadings)
diag(cor_matrix)<-1
#generates 200 random ordinal observations
ord<-rmvord(n = 200, probs = prob, Cor = cor_matrix)
#calculates first principal component
library(psych)
principal(ord,nfactors=1)$value
library(ltm)
ord<-ord-1
descript(ord)
#likelihood ratio test
anova(rasch(ord), ltm(ord ~ z1))
#two-parameter logistic model
fit<-ltm(ord ~ z1)
summary(fit)
#item characteristic curves
plot(fit)
#calculates latent trait scores
pattern<-factor.scores(fit)
#constrains slopes to be equal
fit2<-rasch(ord)
plot(fit2)
summary(fit2)
scores2<-factor.scores(fit2)