Thursday, December 19, 2013

Latent Variable Mixture Models (LVMM): Decomposing Heterogeneity into Type and Intensity

Adding features to a product can be costly, so brands have an incentive to include only those features most likely to increase demand. In the last two posts (first link and second link), I have recommended what could be called a "features stress test" that included both a data collection procedure and some suggestions for how to analyze that data.

Although the proposed analysis will work with any rating scale, one should consider replacing the traditional importance measure with behaviorally-anchored categories. That is, we discontinue the importance ratings with its hard-to-know-what-is-meant-by endpoints of very important and very unimportant, and we substitute a sequence of increasingly demanding actions that require consumers to decide how much they are willing to do or sacrifice in order to learn about or obtain a product with the additional feature. For example, as outlined in the first link, respondents choose 1=not interested, 2=nice to have, 3=tie-breaker, or 4=pay more for (an ordinal scale suitable for scaling with item response theory). The modifier "stress" suggests that the possible actions can be made more and more extreme until all but the most desired features fail to pass the test (e.g., "would pay considerably more for" instead of "pay more for"). The resulting data enables us to compare the impact of different features across consumers given that the same feature prioritization holds for everyone.

To be clear, our focus is on the features, and consumers are simply the measuring instrument. What is the impact of Feature A on purchase interest? We ask Consumer #1, and then Consumer #2 and so on. Since every consumer rates every feature, we can allow our consumers to hold varying standards of comparison, as long as all the features rise or fall together. Thus, it does not concern us if some of our consumers are more involved in the product category and report uniformly greater interest in all the features. Interestingly, although our focus was the feature, we learn something about consumer heterogeneity, which could be useful in future targeting.

Mixture of Customer Types with Different Feature Priorities

Our problem, of course, is that feature impact depends on consumer needs and desires. We cannot simply assume that there is one common feature prioritization that is shared by all. We may not wish to go so far as to suggest a unique feature customization for every customer, but certainly we are likely to find a few segments wanting different feature configurations. As a result, my sample of respondents is not uniform but consists of a mixture or composite of two or more customer types with different feature priorities.

The last two posts with links given above have provided some detail outlining why I believe that rating scales are ordinal and why polytomous item response theory (IRT) provides useful models of the response generation process. I have tried in those posts to provide a more gentle introduction into finite mixtures of IRT models, encouraging a more exploratory two-step process of k-means on person-centered scores followed by graded response modeling for each cluster uncovered.

The claim made by a mixture model is that every respondent belongs to a latent class with its own feature prioritization. Yet, we observe only the feature ratings. However, as I showed in the last post, those ratings contain enough information to identify the respondent's latent class and the profile of feature impact for each of the latent classes. Now for the caveat, the ratings contain sufficient information only if we assume that our data are generated as a finite mixture of some small number of unidimensional polytomous IRT models. Fortunately, we know quite a bit about consumer judgment and decision making so that we have some justification for our assumptions other than that the models seem to fit.

The R package mixRasch Does It Simultaneously

Yes, R can recover the latent class as well as provide person and item estimates with one function called mixRasch (the same name is used for both the function and the package). If my ratings were a binary yes/no or agree/disagree, I would have many more R packages available for the analysis (see Section 2.4 for an overview of mixture IRT in R).

The mixRasch() function is straightforward. You tell it the data, the maximum number of iterations, the number of steps or threshold, the IRT model, and the number of latent classes:

mixRasch(ratings, max.iter=500, steps=3, model="PCM", n.c=2)

The R code to generate the data mixing two groups of respondents with different feature priorities can be found in the last post. The appendix at the end of this post lists the additional R code needed to run mixRasch. The number of steps or thresholds is one less than the number of categories. We will be using the partial credit model (PCM), which behaves in practice much like the graded response model, although the category contrasts are not the same and there is that constant slope common to Rasch models. Of course, there is a lot more to be concerned about when using mixRasch and joint maximum likelihood estimation, and perhaps there will be time in a later post to discuss all that can go wrong.  For now, we will look at the output to discover if we have learned anything about different types of feature prioritization and varying levels of intensity with which consumers want those features.

My example uses nine undefined features divided into three sets of three features each.  The first three features have little appeal to anyone in the sample. Consumer heterogeneity is confined to the last six features. The 200 respondents belong to one of two groups:  the first 100 whose preferences follow the ranking of the features from one to nine and the second 100 who preferred most the middle three features. The details can be found in that much-referenced previous post.

Although I deliberately wanted to keep the example abstract, you can personalize it with any product category of your choice. For example, banking customers can be split into online and in-branch types. The in-branch customer wants nearby branches with lots of helpful personnel. The online customer wants free bill-paying and mobile apps. Both types vary in their usage intensity and their product involvement, so that we expect to see differences within each type reflected in the size of the average rating across all the features. If you don't like the banking example, you can substitute gaming or dining or sports or kitchen appliances or just about any product category.

The output from the mixRasch function is a long list with elements needing to be extracted.  First, we want to know the latent class membership for our 200 respondents. It is not all-or-none but a probability of class membership summing to one. For example, the first respondent has a 0.80 likelihood of belonging to the first latent class and a 0.20 probability of being from the second latent class. This information can be found in the list element $class for every respondent except those giving all the features either the highest or lowest scores (e.g., our respondent in row 155 rated every feature with a four and row 159 contained all ones). If we use the maximum probability to classify respondents into mutually exclusive latent classes, the mixRasch function correctly identifies 84% of the respondents (we only know this because we randomly simulated the data). I should mention that the classification from the mixture Rasch model is not identical to the row-centered k-means from the last post, but there is 92% agreement for this particular example.

Finally, were we successful at recovering the feature prioritizations used to simulate the ratings? In the table below, the D1 column contains the difficulty parameters for the 100 respondents in the first segment. The adjacent column LC1 shows the recovered parameter estimates from the 94 respondent in the first latent class.  Similar results are shown for the second segment and latent class in the D2 and LC2 columns. As you may recall, two respondents giving all ones or all fours could not be classified by the mixRasch function.

D1
LC1
D2
LC2
n
100
94
100
104
V1
1.50
1.71
1.50
1.76
V2
1.25
2.06
1.25
1.59
V3
1.00
0.99
1.00
1.18
V4
0.25
0.62
-1.50
-1.79
V5
0.00
0.24
-1.25
-1.49
V6
-0.25
-0.42
-1.00
-0.88
V7
-1.00
-1.35
0.25
0.30
V8
-1.25
-1.69
0.00
-0.35
V9
-1.50
-2.16
-0.25
-0.33


What have we learned from this and the last two posts?

A single screening question will tell me if I should include you in my survey of the wine market. Determining if you are a wine enthusiast will require many more questions, and it is likely that you will need to match a pattern of responses before classification is final. Yet, typing alone will not be adequate since systematic variation reminds after your classification as a wine enthusiast. It's a matter of degree as one moves from novice to expert, from refrigerator to cellar, and from tasting to wine club host. Our clusters are no longer spherical or elliptical clumps or even regions but elongated networks of ever increasing commitment. As noted in one of my first posts on Archetypal Analysis, the admonition that "one size does not fit all" can be applied to both the need for segmentation and the segmentation process itself.  Customer heterogeneity may be more complex than can be represented by either a latent class or a latent trait alone.

The post was titled "Latent Variable Mixture Models" in an attempt to accurately describe the approach being advanced. The book Advances in Latent Variable Mixture Models was published in 2007, so clearly my title is not original. In addition, a paper with the same name from nursing research provides a readable introduction (e.g., depression is identified by a symptom pattern but differs in intensity from mild to severe). Much of this work uses Mplus instead of R. However, we relied on the R package mixRasch in this post, and R has flexmix, psychomix, mixtools and more that all run some form of mixture modeling. Pursuing this topic would take us some time. So, I am including these references more as a postscript because I wanted to place this post in a broader context without having to explain that broader context.


Appendix with R Code


In order to create the data in ratings, you will need to return to the last post and run portions of the R code listed at the end of that post.

library(mixRasch)
 
# need to set the seed only if
# we want the same result each
# time we run mixRasch
set.seed(20131218)
mR2<-mixRasch(ratings, max.iter=500,
              steps=3, model="PCM", n.c=2)
mR2
 
# shows the list structure 
# containing the output
str(mR2)
 
# latent cluster membership
# probility and max classification
round(mR2$class,2)
cluster<-max.col(mR2$class)
 
# comparison with simulated data
table(c(rep(1,100),rep(2,100)),cluster)
 
# comparison with row-centered
# kmeans from last post
table(cluster,kcl_rc$cluster)

Created by Pretty R at inside-R.org

Sunday, December 15, 2013

The Complexities of Customer Segmentation: Removing Response Intensity to Reveal Response Pattern

At the end of the last post, the reader was left assuming respondent homogeneity without any means for discovering if all of our customers adopted the same feature prioritization. To review, nine features were presented one at a time, and each time respondents reported the likely impact of adding the feature to the current product. Respondents indicated feature impact using a set of ordered behaviorally-anchored categories in order to ground the measurement in a realistic market context. This grounding is essential because feature preference is not retrieved from a table with thousands of decontextualized feature value entries stored in memory. Instead, feature preference is constructed as needed during the purchase process using whatever information is available and whatever memories come to mind at the time. Thus, we want to mimic that purchase process when a potential customer learns about a new feature for the first time. Of course, consumers will still respond if asked about features in the abstract without a purchase context. We are all capable of chitchat, the seemingly endless lists of likes and dislikes that are detached from the real world and intended more for socializing than accurate description.

Moreover, the behaviorally-anchored categories insure both feature and respondent differentiation since a more extreme behavior can always be added if too many respondents select the highest category. That is, if a high proportion of the sample tell us that they would pay for all the features, we simply increase the severity of the highest category (e.g., would pay considerably more for) or we could add it as an additional category. One can think of this as a feature stress test. We keep increasing the stress until the feature fails to perform. In the end, we are able to achieve the desired differentiation among the features for only the very best performing features will make it into the highest category. At the same time we are enhancing our ability to differentiation among respondents because only those with the highest demand levels will be selecting the top-box categories.

Customers Wanting Different Features

Now, what if we have customer segments wanting different features? While we are not likely to see complete reversals of feature impact, we often find customers focusing on different features. As a result, many of the features will be rated similarly, but a few features that some customers report as having the greatest influence will be seen as having a lesser impact by others. For instance, some customers attend more to performance, while others customers place some value on performance but are even more responsive to discounts or rewards. However, everyone agrees that the little "extras" have limited worth.

Specifically, in the last post the nine features were arranged in ascending sets of three with difficulty values of {1.50, 1.25, 1.00}, {0.25, 0, -0.25} and {-1.00, -1.25, -1.50}. You might recall that difficulty refers to how hard it is for the feature to have an impact. Higher difficulty is associated with the feature failing the stress test. Therefore, the first feature with a difficulty of 1.50 finds it challenging to boost interest in the product. On the other hand, the difficulty scores for impactful features, such as the last feature, will be negative and large. One interprets the difficulty scale as if it were a z-score because respondents are located on the same scale and their distribution tends toward normal.

Measurement follows from a model of the underlying processes that generate the item responses, which is why it is call item response theory. Features possess "impactability" that is measured on a difficulty scale.  Customers vary in "persuadability" that we measure on the same scale. When the average respondent (with latent trait theta = 0) confronts an average feature (with difficulty d=0), the result is an average observed score for that feature.

I do not know feature impact or customer interest before I collect the rating data. But afterwards, assuming that the features have the same difficulty for everyone, I can use the item scores aggregated across all the respondents to give me an estimate of each feature's impact. Those features with the largest observed impact are the least difficult (large negative values), and those with little effect are the most difficult (large positive values). Then, I can use those difficulty estimates to determine where each respondent is located. Respondents who are impacted by only the most popular features have less interest than respondents responding to the least popular features. How do I know where you fall along the latent persuadability dimension? The features are ordered by their difficulty, as if they were mileposts along the same latent scale that differentiates among respondents. Therefore, your responses to the features tell me your location.

A Simulation to Make the Discussion Concrete

Estimation becomes more complicated when different subgroups want different features. In the item response theory literature, one refers to such an interaction as differential item functioning (DIF). We will avoid this terminology and concentrate on customer segmentation using feature impact as our basis. A simulation will clarify these points by making the discussion more concrete.

A simple example that combines the responses from two segments with the following feature impact weights will be sufficient for our purposes.

F1
F2
F3
F4
F5
F6
F7
F8
F9
Segment 1
1.50
1.25
1.00
0.25
0.00
-0.25
-1.00
-1.25
-1.50
Segment 2
1.50
1.25
1.00
-1.50
-1.25
-1.00
0.25
0.00
-0.25

Segment 1 shows the same pattern as in the last post with the features in ascending impact and gaps separating the nine into three sets of three features as described above. Segment 2 looks similar to Segment 1, except that the middle three features are the most impactful.  This is a fairly common finding that low scoring features, such as Features 1 through 3, tend to lack impact or importance across the entire sample. On the other hand, when there are feature sets with substantial impact on some segment, those features tend to have some value to everyone. Thus, price is somewhat important to everyone, and really important to a specific segment (which we call the price sensitive). At least, this was my rationale for defining these two segments. Whether you accept my argument or not, I have formed two overlapping segment with some degree of similarity because they share the same weights for the first three features.

Cluster Analyses With and Without Response Intensity

As I show in the R code in the appendix, when you run k-means, you do not recover these two segments. Instead, as shown below with the average ratings across the nine features for each cluster profile, k-means separates the respondents into a "want it all" cluster with higher means across all the features and a "naysayer" cluster with lower means across all the features. What has happened?

Cluster means from a kmeans of the 200 respondents
F1
F2
F3
F4
F5
F6
F7
F8
F9
N
Cluster 1
2.27
2.24
2.59
3.33
3.39
3.47
3.18
3.53
3.60
121
Cluster 2
1.16
1.41
1.29
2.01
1.94
1.86
1.96
2.03
2.05
79

We have failed to separate the response pattern from the response intensity. We have forgotten that our ratings are a combination of feature impact and respondent persuadability. The difficulties or feature impact scores specify only the response pattern.  Although this is not a deterministic model, in general, we expect to see F1<F2<F3<F4<F5<F6<F7<F8<F9 for Segment #1 and F1<F2<F3<F7<F8<F9<F4<F5<F6 for Segment #2. In addition, we have response intensity. Respondents vary in their persuadability. Some have high product involvement and tend to response more positively to all the features. Others are more casual users with lowers scores over all the features. In this particular case, response intensity dominates the cluster analysis, and we see little evidence of the response patterns that generated the data.

A quick solution is to remove the intensity score, which is the latent variable theta.  We can use the person mean score across all the features as an estimate of theta and cluster using deviation from the person mean.  In the R code I have named this transformed data matrix "ipsative" in order to emphasize that the nine features scores have lost a degree of freedom in the calculation of the row mean and that we have added some negative correlation among the features because they now must sum to zero.  This time, when we run the k-means on the row centered ratings data, we recover our segment response patterns with 84% of the respondents placed in the correct segment. Obviously, the numbering has been reversed so that Cluster #1 is Segment #2 and Cluster #2 is Segment #1.

Cluster means from a kmeans of the person-centered data matrix
F1
F2
F3
F4
F5
F6
F7
F8
F9
N
Cluster 1
-0.78
-0.77
-0.54
0.89
0.74
0.44
-0.24
0.15
0.12
86
Cluster 2
-0.65
-0.53
-0.42
-0.21
-0.08
0.18
0.45
0.58
0.68
114

Clustering using person-centered data may look similar to other methods that correct for halo effects. But that is not what we are doing here.  We are not removing measurement bias or controlling for response styles. Although all ratings contain error, it is more likely that the consistent and substantial correlations observed among such ratings are the manifestation of an underlying trait. When the average ratings co-vary with other measures of product involvement, our conclusion is that we are tapping a latent trait that can be used for targeting and not a response style that needs to be controlled. It would be more accurate to say that we are trying to separate two sources of latent variation, the response pattern (a categorical latent variable) and response intensity (a continuous latent variable).

Why do I claim that response pattern is categorical and not continuous?  Even with only a few features, there are many possible combinations and thus more than enough feature prioritization configurations to form a continuum.  Yet, customers seem to narrow their selection down to only a few feature configurations that are marketed by providers and disseminated in the media. This is why response pattern can be treated as a categorical variable. The consumer adopts one of the feature prioritization patterns and commits to it with some level of intensity. Of course, they are free to construct their own personalized feature prioritization, but consumers tend not to do so. Given the commonalities in usage and situational constraints, along with limited variation in product offerings, consumers can simplify and restrict their feature prioritization to only a few possible configurations. In the above example, one selects an orientation, Features 4, 5 and 6 or Features 7, 8, and 9.  How much one wants either set is a manner of degree.

Maps from Multiple Correspondence Analysis (MCA)

It is instructive to examine the multiple correspondence map that we would have produced had we not run the person-centered cluster analysis, that is, had we mistakenly treated the 200 respondents as belonging to the same population. In the MCA plot for the 36 feature categories displayed below, we continue to see separation between the categories so that every feature shows the same movement from left to right along the arc as one moves from the lowest dark red to the highest dark blue (i.e., from 1 to 4 on the 4-point ordinal scale). However, the features no longer are arrayed in order from Feature 9 through Feature 1, as one would expect given that only 100 of the 200 respondents preferred the features in the original rank ordering from 1 through 9. The first three features remain in order because there is agreement across the two segments, but there is greater overlap among the top six features than that which we observed in the prior post where everyone belong to Segment 1. Lastly, we see the arc from the previous post that begins in the upper left corner and extends with a bend in the middle until it reaches the upper right corner. We obtain a quadratic curve because the second dimension is a quadratic function of the first dimension.


Finally, I wanted to use the MCA map to display the locations of the respondents and show the differences between the two cluster analyses with and without row centering. First, without row centering the two clusters are separated along the first dimension representing response intensity or the average rating across all nine features. Our first cluster analysis treated the original ratings as numeric and identified two groupings of respondents, the naysayer (red stars) and those wanting it all (black stars). This plot confirms that response intensity dominates when the ratings are not transformed.


Next, we centered our rows by calculating each respondent's deviation score about their mean rating and reran the cluster analysis with these transformed ratings. We were able to recover the two response generation processes that were used to simulate the data. However, we do not see those segments as clusters in the MCA map. In the map below, the red and black stars from the person-centered cluster analysis form overlapping arcs of respondents spread out along the first dimension. MCA does not remove intensity. In fact, response intensity is the latent trait responsible for the arc.


The final plot presents the results from the principal component analysis for the person-centered data. The dimensions are the first two principal components and the arrows represent the projections of the features onto the map. Respondents in the second cluster with higher scores on the last three features are shown in red, and the black stars indicate respondents from the first cluster with preferences for the middle three features. This plot from the principal component analysis on the person-centered ratings demonstrates how deviation scores remove intensity and reveal segment differences in response pattern.


Appendix with R code needed to run the above analyses.
For a description of the code, please see my previous post.

library(psych)
d1<-c(1.50,1.25,1.00,.25,0,-.25,
      -1.00,-1.25,-1.50)
d2<-c(1.50,1.25,1.00,-1.50,-1.25,-1.00,
      .25,0,-.25)
 
set.seed(12413)
bar1<-sim.poly.npl(nvar = 9, n = 100, 
                   low=-1, high=1, a=NULL, 
                   c=0, z=1, d=d1, 
                   mu=0, sd=1.5, cat=4)
bar2<-sim.poly.npl(nvar = 9, n = 100, 
                   low=-1, high=1, a=NULL, 
                   c=0, z=1, d=d2, 
                   mu=0, sd=1.5, cat=4)
rating1<-data.frame(bar1$items+1)
rating2<-data.frame(bar2$items+1)
apply(rating1,2,table)
apply(rating2,2,table)
ratings<-rbind(rating1,rating2)
 
kcl<-kmeans(ratings, 2, nstart=25)
kcl
 
rowmean<-apply(ratings, 1, mean)
ipsative<-sweep(ratings, 1, rowmean, "-")
round(apply(ipsative,1,sum),8)
kcl_rc<-kmeans(ipsative, 2, nstart=25)
kcl_rc
table(c(rep(1,100),rep(2,100)),
      kcl_rc$cluster)
 
F.ratings<-data.frame(ratings)
F.ratings[]<-lapply(F.ratings, factor)
 
library(FactoMineR)
mca<-MCA(F.ratings)
 
categories<-mca$var$coord[,1:2]
categories[,1]<--categories[,1]
categories
 
feature_label<-c(rep(1,4),rep(2,4),rep(3,4),
                 rep(4,4),rep(5,4),rep(6,4),
                 rep(7,4),rep(8,4),rep(9,4))
category_color<-rep(c("darkred","red",
                      "blue","darkblue"),9)
category_size<-rep(c(1.1,1,1,1.1),9)
plot(categories, type="n")
text(categories, labels=feature_label, 
     col=category_color, cex=category_size)
 
mca2<-mca
mca2$var$coord[,1]<--mca$var$coord[,1]
mca2$ind$coord[,1]<--mca$ind$coord[,1]
plot(mca2$ind$coord[,1:2], col=kcl$cluster, pch="*")
plot(mca2$ind$coord[,1:2], col=kcl_rc$cluster, pch="*")
 
pca<-PCA(ipsative)
plot(pca$ind$coord[,1:2], col=kcl_rc$cluster, pch="*")
arrows(0, 0, 3.2*pca$var$coord[,1], 
       3.2*pca$var$coord[,2], col = "chocolate", 
       angle = 15, length = 0.1)
text(3.2*pca$var$coord[,1], 3.2*pca$var$coord[,2],
     labels=1:9)
Created by Pretty R at inside-R.org

Tuesday, December 10, 2013

Feature Prioritization: Multiple Correspondence Analysis Reveals Underlying Structure

Measuring the Power of Product Features to Generate Increased Demand

Product management requires more from feature prioritization than a rank ordering. It is simply not enough to know the "best" feature if that best does not generate increased demand. We are not searching for the optimal features in order to design a product that no one will buy. We want to create purchase interest; therefore, that is what we ought to measure.

A grounded theory of measurement mimics the purchase process by asking consumers to imagine themselves buying a product. When we wish to assess the impact of adding features, we think first of a choice model where a complete configuration of product features are systematically manipulated according to an experimental design. However, given that we are likely to be testing a considerable number of separate features rather than configuring a product, our task becomes more of a screening process than a product design. Choice modeling requires more than we want to provide, but we would still like customer input concerning the value of each of many individual features.

We would like to know the likely impact that each feature would have if it were added to the product one at a time. Is the feature so valuable that customers will be willing to pay more? Perhaps this is not the case, but could the additional feature serve as a tie-breaker? That is, given a choice between one offer with the feature and another offer without the feature, would the feature determine the choice? If not, does the feature generate any interest or additional appeal? Or, is the feature simply ignored so that the customer reports "not interested" in this feature?

What I am suggesting is an ordinal, behaviorally-anchored scale with four levels: 1=not interested, 2=nice to have, 3=tie-breaker and 4=pay more for. However, I have no commitment to these or any other specific anchors. They are simply one possibility that seems to capture important milestones in the purchase process. What is important is that we ask about features within a realistic purchase context so that the respondent can easily imagine the likely impact of adding each feature to an existing product.

Data collection, then, is simply asking consumers who are likely to purchase the product to evaluate each feature separately and place it into one of the ordered behavioral categories. Feature prioritization is achieved by comparing the impact of the features. For example, if management wants to raise its prices, then only those features falling into the "pay more for" category would be considered a positive response. But we can do better than this piecemeal category-specific comparison by studying the pattern of impact across all the features using item response theory (IRT). In addition, it is important to note that in contrast with importance ratings, a behaviorally anchored scale is directly interpretable. Unlike ratings of "somewhat important" or a four on a five-point scale, the product manager has a clear idea of the likely impact of a "tie-breaker" on purchase interest.

It should be noted that the respondent is never asked to rank order the features or select the best and worst from a set of a four or five features. Feature prioritization is the job of the product manager and not the consumer. Although there are times when incompatible features require the consumer to make tradeoffs in the marketplace (e.g., price vs. quality), this does not occur for most features and is a task with which consumers have little familiarity. Those of us who like sweet-and-sour food have trouble deciding whether we value sweet more than sour. If the feature is concrete, the consumer can react to the feature concept and report its likely impact on their behavior (as long as we are very careful not to infer too much form self-reports). One could argue that such a task is grounded in consumer experience. On the other hand, what marketplace experience mimics the task of selecting the best or the worst among small groupings of different features? Let the consumer react, and let product management do the feature prioritization.

What is Feature Prioritization?

The product manager needs to make a design decision. With limited resources, what is the first feature that should be added? So why not ask for a complete ranking of all the features? For a single person, this question can be answered with a simple ordering of the features. That is, if my goal were to encourage this one person to become more interested in my product, a rank ordering of the features would provide an answer. Why not repeat the same ranking process for n respondents and calculate some type of aggregate rank?

Unfortunately, rank is not purchase likelihood. Consider the following three respondents and their purchase likelihoods for three versions of the same product differing only in what additional feature is included.

Purchase Likelihood
Rank Ordering
f1
f2
f3
f1
f2
f3
r1
0.10
0.20
0.30
3
2
1
r2
0.10
0.20
0.30
3
2
1
r3
0.90
0.50
0.10
1
2
3
0.37
0.30
0.23
2.33
2.00
1.67

What if we had only the rankings and did not know the corresponding purchase interest? Feature 3 is ranked first most often and has the highest average ranking. Yet, unknown to us because we did not collect the data, the highest average purchase likelihood belongs to Feature 1. It appears that we may have ranked prematurely and allowed our desire for feature differentiation to blind us to the unintended consequences.

We, on the other hand, have not taken the bait and have tested the behaviorally-anchored impact of each feature. Consequently, we have a score from 1 to 4 for every respondent on every feature. If there were 200 consumers in the study and 9 features, then we would have a 200 x 9 data matrix filled with 1's, 2's, 3's, and 4's. And now what shall we do?

Analysis of the Behaviorally-Anchored Categories of Feature Impact

First, we recognize that our scale values are ordinal. Our consumers read each feature description and ask themselves which category best describes the feature's impact. The behavioral categories are selected to represent milestones in the purchase process. We encourage consumers to think about the feature in the context of marketplace decisions. Our interest is not the value of a feature in some Platonic ideal world, but the feature's actual impact on real-world behavior that will affect the bottom line. When I ask for an importance rating without specifying a context, respondents are on their own with little guidance or constraint. Behaviorally-anchored response categories constrain the respondent to access only the information that is relevant to the task of deciding which of the scale values most accurately represents their intention.

To help understand this point, one can think of the difference between measuring age in equal yearly increments and measuring age in terms of transitional points in the United States such as 18, 21, and 65. In our case we want a series of categories that both span the range of possible feature impacts and have high imagery so that the measurement is grounded in a realistic context. Moreover, we want to be certain to measure the higher ends of feature impact because we are likely to be asking consumers about features that we believe they want (e.g., "pay more for" or "must have" or "first thing I would look for"). Such breakpoints are at best ordinal and require something like the graded response model from item response theory (an overview is available at this prior post).

Feature prioritization seeks a sequential ordering of the features along a single dimension representing its likely impact on consumers. The consumer acts as a self-informant and reports how they would behave if the feature were added to the product. It is important to note that both the features and consumers vary along the same continuum. Features have greater or less impact, and consumers report varying levels of feature impact. That is, consumer have different levels of involvement with the product so that their wants and needs have different intensity. Greater product involvement leads to higher impact ratings across all the features. For example, cell phone users vary in the intensity with which they use their phone as a camera. As a result, when asked about the likely impact of additional cell phone features associated with the picture quality or camera ease of use, the more intense camera users are likely to give uniformly higher ratings to all the features. The features still receive different scores with the most desirable features getting the highest scores. Consequently, we might expect to observe the same pattern of high and low impact ratings for all the users with elevation of the curve dependent on feature usage intensity.

We need an example to illustrate these points. The R package psych contains a function for simulating graded response data. As shown in the appendix, we have random generated 200 respondents who rated 9 features on the 4-point anchored scale previous defined. The heatmap below reveals the underlying structure.
The categories are represented by color with dark blue=4, light blue=3, light red=2 and dark red=1. The respondents have been sorted by their total scores so that the consistently lowest ratings are indicated by rows of dark red toward the top of the heatmap. As one moves down the heatmap, the rows change gradually from predominantly red to blue. In addition, the columns have been sorted from the features with the lowest average rating to those features with the highest rating. As a result, the color changes in the rows follow a pattern with the most wanted features turning colors first as we proceed down the heatmap. V9 represents a feature that most want, but V1 is desired by only the most intense user. Put another way, our user in the bottom row is a "cell phone picture taking machine" who has to have every feature, even those features with which most others have little interest. However, the users near the top of our heatmap are not interested in any of the additional features.

Multiple Correspondence Analysis (MCA)

In the prior post that I have already referenced, the reader can find a worked example of how to code and interpret a graded response model using the R package ltm for latent trait modeling. Instead of repeating that analysis with a new data set, I wanted to supplement the graphic displays from heatmaps with the graphic displays from MCA. Warrens and Heiser have written an overview of this approach, but let me emphasize a couple of points from that article. MCA is a dual scaling technique. Dual scaling refers to the placement of rows and column on the same space. Similar rows are placed near each other and away from dissimilar rows. The same is true for columns. The relationship between rows and columns, however, is not directly plotted on the map, although in general one will find respondents located in the same region as the columns they tended to select. Even if you have had no prior experience with MCA, you should be able to follow the discussion below as if it were nothing more than a description of some scatterplots.

As mentioned before, the R code for all the analysis can be found in the appendix at the end of this post. Unlike a graded response model, MCA treats all category levels as nominal or factors in R. Thus, the 200 x 9 data matrix must be expanded to repeat the four category levels for each of the nine features. That is, a "3" (indicating that the feature would break a tie between two otherwise equivalent products) is represented in MCA by four category levels taking only zero and one values (in this case 0, 0, 1, 0). This means that the 200 x 9 data matrix becomes a 200 x 36 indicator matrix with four columns for each feature.

Looking at our heatmap above, the bottom rows with many 3's and 4's (blue) across all the features will be positioned near each other because their response patterns are similar. We would expect the same for the top rows, although in this case it is because the respondents share many 1's and 2's (dark red). We can ask the same location question about the 36 columns, but now the heatmap shows only the relationships among the 9 features. Obviously, adjacent columns are more similar, but we will need to examine the MCA map to learn about the locations of the category levels for each feature.  I have presented such a map below displaying the positions of the 36 columns on the first two latent dimensions.


The numbers refer to the nine features (V1 to V9). There are four 9's because there are four category levels for V9. I used the same color scheme as the heatmap, so that the 1's are dark red, and I made them slightly larger in order differentiate them from the 2's that are a lighter shade of red. Similarly, the light blue numbers refer to category level 3 for each feature, and the slightly larger dark blue numbers are the feature's fourth category (pay more for).

You can think of this "arc" as a path along which the rows of the heatmap fall. As you move along this path from the upper left to the upper right, you will trace out the rows of the heatmap. That is, the dark red "9" indicates the lowest score for Feature 9. It is followed by the lowest score for Feature 8 and Feature 7. What comes next? The light red for Feature 9 indicates the second lowest score for Feature 9. Then we see the lowest scores for the remaining features. This is the top of our heatmap. When we get to the end of our path, we see a similar pattern in reverse with respondents at the bottom of the heatmap.

Sometimes it helps to remember that the first dimension represents the respondent's propensity to be impacted by the features. As a result, we see the same rank ordering of the features repeated for each category level. For example, you can see the larger and darker red numbers decrease from 9 to 1 as you move down from the upper left side. Then, that pattern is repeated for the lighter and smaller red numbers, although Features 6 and 1 seem to be a little out of place. Feature 2 is hidden, but the light blue features are ordered as expected. Finally, the last repetition is for the larger dark blue numbers, even if Feature 4 moved out of line. In this example with simulated data, the features and the categories are well-separated. While we will always expect to see the categories for a single feature to be ordered, it is common to see overlap between different features and their respective categories.

It is worth our time to understand the arrangement of feature levels by examining the coordinates for the 36 columns that are plotted in the above graph (shown below). First, the four levels for each feature follow the same pattern with 1 < 2 < 3 < 4. Of course, this is what one would expect given that the scale was constructed to be ordinal. Still, we have passed a test since MCA does not force this ordering. The data entering a MCA are factors or nominal variables that are not ordered. Second, the fact that Feature 9 is preferred over Feature 1 can be seen in the placement of the four levels for each feature. Feature 1 starts at -0.39 (V1_1) and ends at 2.02 (V1_4). Feature 9, on the other hand, begins at -1.69 (V9_1) and finishes at 0.44 (V9_4). V9_1 has the lowest value on the first dimension because only a respondent with no latent interest in any feature would give the lowest score to the best feature. Similarly, the highest value on the first dimension belongs to V1_4 since only a zealot would pay more for the worst feature.


Dim 1
Dim 2
V1_1
-0.39
-0.06
V1_2
0.17
-0.36
V1_3
1.06
0.91
V1_4
2.02
1.97
V2_1
-0.58
0.13
V2_2
0.27
-0.64
V2_3
0.96
0.55
V2_4
1.46
1.42
V3_1
-0.80
0.13
V3_2
0.16
-0.33
V3_3
0.86
0.03
V3_4
1.14
1.16
V4_1
-0.97
0.18
V4_2
-0.18
0.07
V4_3
0.41
-0.39
V4_4
0.92
0.42
V5_1
-0.88
0.42
V5_2
-0.55
-0.25
V5_3
0.24
-0.54
V5_4
1.03
0.68
V6_1
-1.26
0.68
V6_2
-0.22
-0.36
V6_3
0.08
-0.41
V6_4
0.96
0.54
V7_1
-1.54
1.28
V7_2
-0.72
-0.32
V7_3
0.04
-0.36
V7_4
0.60
0.19
V8_1
-1.52
1.59
V8_2
-0.93
-0.25
V8_3
-0.20
-0.25
V8_4
0.58
0.10
V9_1
-1.69
2.24
V9_2
-1.33
0.81
V9_3
-0.30
-0.34
V9_4
0.44
-0.07

Finally, let us see how the respondents are positioned on the same MCA map. The red triangles are the 36 columns of the indicator matrix whose coordinates we have already seen in the above table. The blue dots are respondents. Respondents with similar response profiles are placed near each other. Given the feature structure shown in the heatmap, total score becomes a surrogate for respondent similarity. Defining a respondent's total score as the sum of the nine feature scores means that the total scores can range from 9 to 36. The 9'ers can be found at the top of the heatmap, and the 36'ers are at the bottom. It should be obvious that the closer the rows in the heatmap, the more similar the respondents.

Again, we see an arc that can be interpreted as the manifold or principal curve showing the trajectory of the underlying latent trait. The second dimension is a quadratic function of the first dimension (dim 2 = f(dim 1^2) with R-square = 0.86). This effect has been named the "horseshoe" effect. Although that name is descriptive, it encourages us to think of the arc as an artifact rather than a quadratic curve representing a scaling of the latent trait.


Finally, respondents fall along this arc in same order as their total scores. Respondent at the low end of the arc in the upper left are those giving the lowest scores to all the items. At the end of the arc in the upper right is where we find those respondents giving the highest feature scores.

Caveats and Other Conditions, Warnings and Stipulations

Everything we have done depends on the respondent's ability to know and report accurately on how the additional feature will impact them in the marketplace. Self-report, however, does not have a good track record, though sometimes it is the best we can do when there are many features to be screened. Besides social desirability, the most serious limitation is that respondents tend to be too optimistic because their mental simulations do not anticipate the impediments that will occur when the product with the added feature is actually offered in the marketplace. Prospection is no more accurate than retrospection.

Finally, it is an empirical question whether the graded response model captures the underlying feature prioritization process. To be clear, we are assuming that our behaviorally-anchored ratings are generated on the basis of a single continuous latent variable along which both the features and the respondents can be located. This may not be the case. Instead, we may have a multidimensional feature space, or our respondents may be a mixture of customer segments with different feature prioritization.

If customer heterogeneity is substantial and the product supports varying product configurations, we might find different segments wanting different feature sets. For example, feature bundles can be created to appeal to diverse customer segments as when a cable or direct TV provider offers sports programming or movie channel packages at a discounted price. However, this is not a simple finite mixture or latent class model but a hybrid mixture of customer types and intensities for some buyers of sports programming are sports fanatics who cannot live without it and other buyers are far less committed. You can read more about hybrid mixtures of categorical and continuous latent variables in a previous post.

In practice, one must always be open to the possibility that your data set is not homogeneous but contains one or more segments seeking different features. Wants and needs are not like perceptions, which seem to be shared even by individuals with very different needs (e.g., you may not have children and never go to McDonald's but you know that it offers fast food for kids). Nonetheless, when the feature set does not contain bundles deliberately chosen to appeal to different segments, the graded response model seems to perform well.

I am aware that our model imposes a good deal of structure on the response generation process. Yet in the end, the graded response model reflects the feature prioritization process in the same way that a conjoint model reflects the choice process. Conjoint models assume that the product is a bundle of attributes and that product choice can be modeled as an additive combination of the value of attribute levels. If the conjoint model can predict choice, it is accepted as an "as if" model even when we do not believe that consumers stored large volumes of attribute values that they retrieve from memory and add together. They just behave as if they did.


Appendix:  All the R code needed to create the data and run the analyses

I used a function from the psych library to generate the 4-point rating scale for the graded response model.  One needs to set the difficulty values using d and realize that the more difficult items are the less popular ones (i.e., a "hard" feature finds it "hard" to have an impact, while an "easy" feature finds it "easy").  The function sim.poly.npl() need to know the number of variables(nvar), the number of respondents(n), the number of categories (cat), and the mean(mu) plus standard deviation (sd) for the normal distribution describing the latent trait differentiating among our respondents.  The other parameters can be ignored for this example.  The function returns a list with scale scores in bar$items from 0 to 3 (thus the +1 to get a 4-point scale from1 to 4).

The function heatmap.2() comes from the gplots package.  Since I have sorted the data matrix by row and column marginals, I have suppressed the clustering of row and columns.

The MCA() function from FactoMineR needs factors, so there are three line showing how to make ratings a data frame and then use lapply to convert the numeric to factors.  You will notice that I needed to flip the first dimension to run from low to high, so there are a number of lines that reverse the sign of the first dimension.


library(psych)
d<-c(1.50,1.25,1.00,.25,0,
     -.25,-1.00,-1.25,-1.50)
set.seed(12413)
bar<-sim.poly.npl(nvar = 9, n = 200, 
                  low=-1, high=1, a=NULL, 
                  c=0, z=1, d=d, mu=0, 
                  sd=1, cat=4)
ratings<-bar$items+1
 
library(gplots)
feature<-apply(ratings,2,mean)
person<-apply(ratings,1,sum)
ratingsOrd<-ratings[order(person),
                    order(feature)]
heatmap.2(as.matrix(ratingsOrd), Rowv=FALSE, 
          Colv=FALSE, dendrogram="none", 
          col=redblue(16), key=FALSE, 
          keysize=1.5, density.info="none", 
          trace="none", labRow=NA)
 
F.ratings<-data.frame(ratings)
F.ratings[]<-lapply(F.ratings, factor)
str(F.ratings)
 
library(FactoMineR)
mca<-MCA(F.ratings)
summary(mca)
 
categories<-mca$var$coord[,1:2]
categories[,1]<--categories[,1]
categories
 
feature_label<-c(rep(1,4),rep(2,4),rep(3,4),
                 rep(4,4),rep(5,4),rep(6,4),
                 rep(7,4),rep(8,4),rep(9,4))
category_color<-rep(c("darkred","red",
                      "blue","darkblue"),9)
category_size<-rep(c(1.1,1,1,1.1),9)
plot(categories, type="n")
text(categories, labels=feature_label, 
     col=category_color, cex=category_size)
 
mca2<-mca
mca2$var$coord[,1]<--mca$var$coord[,1]
mca2$ind$coord[,1]<--mca$ind$coord[,1]
plot(mca2, choix="ind", label="none")
Created by Pretty R at inside-R.org