Sunday, March 23, 2014

Warning: Clusters May Appear More Separated in Textbooks than in Practice

Clustering is the search for discontinuity achieved by sorting all the similar entities into the same piles and thus maximizing the separation between different piles. The latent class assumption makes the process explicit. What is the source of variation among the objects? An unseen categorical variable is responsible. Heterogeneity arises because entities come in different types. We seem to prefer mutually exclusive types (either A or B), but will settle for probabilities of cluster membership when forced by the data (a little bit A but more B-like). Actually, we are more likely to acknowledge that our clusters overlap early on and then forget because it is so easy to see type as the root cause of all variation.

I am asking the reader to recognize that statistical analysis and its interpretation extend over time. If there is variability in our data, a cluster analysis will yield partitions. Given a partitioning, a data analyst will magnify those differences by focusing on contrastive comparisons and assigning evocative names. Once we have names, especially if those names have high imagery, can we be blamed for the reification of minor distinctions? How can one resist segments from Nielsen PRIZM with names like "Shotguns and Pickups" and "Upper Crust"? Yet, are "Big City Blues" and "Low-Rise Living" really separate clusters or simply variations on a common set of dwelling constraints?

Taking our lessons seriously, we expect to see the well-separated clusters displayed in textbooks and papers. However, our expectations may be better formed than our clusters. We find heterogeneity, but those differences are not clumping or distinct concentrations. Our data clouds can be parceled into regions, although those parcels run into one another and are not separated by gaps. So we name the regions and pretend that we have assigned names to types or kinds of different entities with properties that control behavior over space and time. That is, we have constructed an ontology specifying categories to which we have given real explanatory powers.

Consider the following scatterplot from the introductory vignette in the R package mclust. You can find all the R code needed to produce these figures at the end of this post.

This is the Old Faithful geyser data from the "datasets" R package showing the waiting time in minutes between successive eruptions on the y-axis and the duration of the eruptions along the x-axis. It is worth your time to get familiar with Old Faithful because it is one of those datasets that gets analyzed over and over again using many different programs. There seems to be two concentrations of points: shorter eruption that occurs more quickly and longer eruptions that have a longer waiting period. If we told the Mclust function from the mclust package that the scatterplot contains observations from G=2 groups, the function would produce a classification plot that looked something like this: 

The red and the blue with their respective ellipses are the two normal densities that are getting mixed. It is such a straightforward example of finite mixture or latent class models (as these models are also called by analysts in other fields of research). If we discovered that there were two pools feeding the geyser, we could write a compelling narrative tying all the data together.

The mclust vignette or manual is comprehensive but not overly difficult. If you prefer a lecture, there is no better introduction to finite mixture than the MathematicalMonk YouTube video. The key to understanding finite mixture models is recognizing that the underlying latent variable responsible for the observed data is categorical, a latent class which we do not observe, but which explains the location and shape of the data points. Do you have a cold or the flu? Without medical tests, all we can observe are your symptoms. If we filled a room with coughing, sneezing, achy and feverish people, we would find a mixture of cold and flu with differing numbers of each type.

This appears straightforward, but for the general case, how does decide how many categories, in what proportions, and with what means and covariance matrices? That is, those two ellipses in the above figure are drawn using a vector with means for the x- and y-axes plus a 2x2 covariance matrix. The means move the ellipse over the space, and the covariance matrix changes the shape and orientation of the ellipse. A good summary of the possible forms is given in Table 1 of the mlcust vignette. 

Unfortunately, the mathematics of the EM algorithm used to solve this problem gets complicated quickly. Fortunately, Chris Bishop provides an intuitive introduction in a 2004 video lecture. Starting at 44:14 of Part 4, you will find a step-by-step description of how the EM algorithm works with the Old Faithful data. Moreover, in Chapter 9 of his book, Bishop cleverly compares the workings of the EM and k-means algorithms, leaving us with a better understanding of both techniques.

If only my data showed such clear discontinuity and could be tied to such a convincing narrative.

Product Categories Are Structured around the Good, the Better, and the Best

Almost every product category offers a range of alternatives that can be described as good, better, and best. Such offerings reflect the trade-offs that customers are willing to make between the quality of the products they want and the amount that they are ready to spend. High-end customers demand the best and will pay for it. On the other end, one finds customers with fewer needs and smaller budgets accepting less and paying less. Clearly, we have heterogeneity, but are those differences continuous or discrete? Can we tell by looking at the data?

Unlike the Old Faithful data with well-separated grouping of data points, product quality-price trade-offs look more like the following plot of 300 consumers indicating how much they are willing to spend and what they expect to get for their money (i.e., product quality is a composite index combining desired features and services).

There is a strong positive relationship between demand for quality and willingness to pay, so a product manager might well decide that there was opportunity for at least a high-end and a low-end option. However, there is no natural breaks in the scatterplot. Thus, if this data cloud is a mixture of distinct distributions, then these distributions must be overlapping. 

Another example might help. As shown by John Cook, the distribution of heights among adults is a mixture two overlapping normal distribution, one for men and another for women. Yet, as you can observe from Cook's plots, the mixture of men's and women's height does not appear bimodal because the separation between the two distributions is not large enough. If you follow the links in Cook's post, eventually you will find the paper "Is Human Height Bimodal?", which clearly demonstrates that many mixtures of distributions appear to be homogeneous. We simply cannot tell that they are mixture by looking just at the shape of the distribution for the combined data. The Old Faithful data with its well-separated bimodal curve provides a nice contrast, especially when we focus only on waiting time as a single dimension (Fitting Mixture Models with the R Package mixtools).

Perhaps then, segmentation does not require gaps in the data cloud. As Wedel and Kamakura note, "Segments are not homogeneous groupings of customers naturally occurring in the marketplace, but are determined by the marketing manager's strategic view of the market." That is, one could look at the above scatterplot, see the presence of a strong first principal component from the closest of the data points to the principal axis of the ellipse and argue that customer heterogeneity is a single continuous dimension running from the low- to the high-end of the product spectrum. Or, one could look at the same scatterplot and see three overlapping segments seeking basic, value and premium products (good, better, best). Let's run mclust and learn if we can find our three segments.


When instructed to look for three clusters, the Mclust function returned the above result. The 300 observed points are represented as a mixture of three distributions falling along the principal axis of the larger ellipse formed by all the observations. However, if I had not specified three clusters and asked Mclust to use its default BIC criterion to select the number of segments, I would have been told that there was no compelling evidence for more than one homogeneous group. Without any prior specification Mclust would have returned a single homogeneous distribution, although as you can see from the R code below, my 300 observations were a mixture of three equal size distributions falling along the principal axis and separated by one standard deviation.

Number of Segments = Number Yielding Value to Product Management

Market segmentation lies somewhere between mass marketing and individual customization. When mass marketing fails because customers have different preferences or needs and customization is too costly or difficult, the compromise is segmentation. We do not need "natural" grouping, but just enough coherence for customers to be satisfied by the same offering. Feet come in many shapes and sizes. The shoe manufacturer can get along with three sizes of sandals but not three sizes of dress shoes. It is not the foot that is changing, but the demands of the customer. Thus, even if segments are no more than convenient fictions, they can be useful from the manager's perspective.

My warning still holds. Reification can be dangerous. These segments are meaningful only within the context of the marketing problem created by trying to satisfy everyone with products and services that yield maximum profit. Some segmentations may return clusters that are well-separated and represent groups with qualitatively different needs and purchase processes. Many of these are obvious and define different markets. If you don't have a dog, you don't buy dog food. However, when segmentation seeks to identify those who feel that their dog is a member of the family, we will find overlapping clusters that we treat differently not because we have revealed the true underlying typology, but because it is in our interest. Don't be fooled into believing that our creative segment names reveal the true workings of the consumer mind.

Finally, what is true for marketing segmentation is true for all of cluster analysis. "Clustering: Science or Art?" (a 2009 NIPS workshop) raises many of these same issues for cluster analysis in general. Videos of this workshop are available at Videolectures. Unlike supervised learning with its clear criterion for success and failure, clustering depends on users of the findings to tell us if the solution is good or bad, helpful or not. On the one hand, this seems to make everything more difficult. On the other hand, it frees us to be more open to alternative methods for describing heterogeneity as it is now and how it evolves over time.  

We seek to understand the dynamic structure of diversity, which only sometimes takes the form of cohesive clusters separated by gaps. Other times, a model with only continuous latent variables seems to be the best choice (e.g., brand perceptions). And, not unexpectedly, there are situations where heterogeneity cannot be explained without both categorical and continuous latent variables (e.g., two or more segments seeking alternative benefit profiles with varying intensities).

Yet, even these three combinations cannot adequately account for all the forms of diversity we find in consumer data.  Innovation might generate a structure appearing more like long arrays or streams of points seemingly pulled toward the periphery by an archetypal ideal or aspirational goal. And if the coordinate space of k-means and mixture models becomes too limiting, we can replace it with pairwise dissimilarity and graphical clustering techniques, such as affinity propagation or spectral clustering. Nor should we be wedded to the stability of our segment solution when those segments were created by dynamic forces that continue to act and alter its structure. Our models ought to be as diverse as the objects we are studying.

R code for all figures and analysis

#attach faithful data set
data(faithful)
plot(faithful, pch="+")
 
#run mclust on faithful data
require(mclust)
faithfulMclust<-Mclust(faithful, G=2)
summary(faithfulMclust, parameters=TRUE)
plot(faithfulMclust)
 
#create 3 segment data set
require(MASS)
sigma <- matrix(c(1.0,.6,.6,1.0),2,2)
mean1<-c(-1,-1)
mean2<-c(0,0)
mean3<-c(1,1)
set.seed(3202014)
mydata1<-mvrnorm(n=100, mean1, sigma)
mydata2<-mvrnorm(n=100, mean2, sigma)
mydata3<-mvrnorm(n=100, mean3, sigma)
mydata<-rbind(mydata1,mydata2,mydata3)
colnames(mydata)<-c("Desired Level of Quality",
                    "Willingness to Pay")
plot(mydata, pch="+")
 
#run Mclust with 3 segments
mydataClust<-Mclust(mydata, G=3)
summary(mydataClust, parameters=TRUE)
plot(mydataClust)
 
#let Mclust decide on number of segments
mydataClust<-Mclust(mydata)
summary(mydataClust, parameters=TRUE)

Created by Pretty R at inside-R.org

Tuesday, January 28, 2014

Context Matters When Modeling Human Judgment and Choice

Herbert Simon was succinct when he argued that judgment and choice "is shaped by a scissor whose two blades are the structure of the task environment and the computational capabilities of the actor" (Simon, 1990, p.7). As a marketing researcher, I take Simon seriously and will not write a survey item without specifying the respondent's task and what cognitive processes will be involved in the task resolution.

Thus, when a client asks for an estimate of the proportion of European car rentals made in the United States that will be requests for automatic transmissions, I do not ask "On a scale for 1=poor to 10=excellent, how would you rate your ability to drive a car with a manual transmission?" Estimating one's ability, which involves an implicit comparison with others, does not come close to mimicking the car rental task structure. Nor would I ask for the likelihood of ordering an automatic transmission because probability estimation is not choice. Likelihood will tend to be more sensitive to factors that will never be considered when the task is choice. In addition, I need a context and a price. It probably makes a difference if the rental is for business or personal use, for driving in the mountains or in city traffic, the size of the vehicle, and much more. Lastly, the proportion of drivers capable of using a stick shift increases along with the additional cost for an automatic transmission. Given a large enough incremental price for automatic transmissions, many of us will discover our hidden abilities to shift manually.

The task structure and the cognitive processing necessary to complete the task determine what data need to be collected. In marketing research, the task is often the making of a purchase, that is, the selection of a single option from many available alternatives. Response substitution is not allowed. A ranking or a rating alters the task structure so that we are now measuring something other than what type of transmission will be requested. Different features become relevant when we choose, when we rate, and when we rank the same product configurations. Moreover, the divergence between choice and rating is only increased by repeated measures. The respondent will select the same alternative when minor features are varied, but that respondent will feel compelled to make minor adjustments in their ratings under the same conditions. Task structure elicits different cognitive processing as the respondent solves different problems. Ratings, ranking and choice are three different tasks. Each measures preference as constructed in order to answer the specific question.

Context matters when the goal is generalization, and one cannot generalize from the survey to the marketplace unless the essential task structure has been maintained. For example, I might wish to determine not only what type of car you intend to rent in your next purchase but what you might do over your next ten rentals. Now, we have changed the task structure because car rentals take place over time. We do not reserve our next ten rentals on a single occasion, nor can we anticipate how circumstances will change over time. The "next ten purchases question" seems to be more a measure of intensity than anticipated marketplace behavior.

Nor can one present a subset of available alternatives and ask for the most and least preferred from this reduced list without modifying the task structure and the cognitive processing used to solve the tasks. The alternatives that are available frame the choice task. I prefer A to B until you show me C, and then I decide not to buy anything. Or, adding a more expensive alternative to an existing product configuration increases the purchase of medium priced options by making them seem less expensive. Context matters. When we ignore it, we lose the ability to generalize our research to the marketplace. Finally, self-reports of the importance or contribution of features are not context-free; they simply lack any explicit context so that respondents can supply whatever context comes to mind or they can just chit-chat.

The implications for statistical modeling in R are clear. We begin with a description of the marketplace task. This determines our data collection procedures and places some difficult demands on the statistical model. For example, purchase requires a categorical dependent variable and a considerable amount of data to yield individual estimates. Yet, we cannot simply increase the number of choice sets given to each respondent because repeated measures from the same individual alters that individual's preferences (e.g., price sensitivity tends to increase over repeated exposures to price variation). Bayesian modeling within R allows us to exploit the hierarchical structure within a data set so that we can use data from all the respondents to compensate for our inability to collect much information from any one person. However, borrowing data from others in hierarchical Bayes is not unlike borrowing clothes from others; the sharing works only when the others are exchangeable and come from the same segment with the a common distribution of estimates.

None of this seems to be traditional preference elicitation, where we assume that preference is established and well-formed, requiring only some means for expression. Preference or value is the latent variable responsible for all observed indicators. Different elicitation methods may introduce some unique measurement effects, but they all tap the same latent construct. Simon, on the other hand, sees judgment and decision making as a form of problem solving. Preferences can still be measured, but preferences are constructed as solutions to specific problems within specific task structures. Although preference elicitation is clearly not dead, we can expect to see increasing movement toward context awareness in both computing and marketing.




Friday, January 17, 2014

Metaphors Matter: Factor Structure vs. Correlation Network Maps

The psych R package includes a data set called "bfi" with self-report ratings on 25 personality items along a 6-point agreement scale. All the details are provided in the documentation accompanying the package. My focus is how to represent the correlations among these ratings: factor analysis or network graphics?

Let's start with the correlation network map produced by the R package qgraph. As always, all the R code can be found at the end of this post.


First, we need to discover the underlying pattern, so we will begin by looking for nodes with the highest correlations and thus interconnected with the thickest lines. Red lines indicate negative correlations (e.g., those who claim that they are "indifferent to others" are unlikely to tell us that they "inquire about others" or "comfort others"). Positive correlations are shown in green (e.g., several nodes toward the bottom of the network suggest that those who report "mood swings" and "panic easily" also said that they are easy to anger and irritate). The node "reflect on things" seems to be misplaced, but it is not. The thin red and green lines suggest that it has uniformly low correlations with all the other items, which explain why it is positioned at the periphery but closest to the other four items with which it is the most correlated.

Using this approach, we can identify several regions that are placed near each other because of their interconnections.  For instance, the personal problems mentioned previously and located toward the bottom of the graph are separated from but linked to the measures of introversion ("difficult approach others" and "don't talk"), which in turn have strong negative correlations with extroversion ("makes friends").  As we continue up the graph on the left side, we find an active openness to others that becomes take charge and conscientious. If we continue back down the right side, respondents note what might be called work-related problems. Now, we have our story, and we can see the two-dimensional structure defining the correlation network: internal vs. external and in-control vs. out-of-control.

Next, we can compare this network representation with the more traditional factor model. Why do we observe correlations among observed variables? Correlations are the results of latent variables. We see this in the factor model diagram created using the same data. For example, individuals possess some degree of neuroticism (labeled RC2), therefore the five personal problem items are intercorrelated.  The path coefficient associated with each arrow indicates the correlation between the factor and the observed variable, and the product of the path coefficients for any two observed variables is our estimate of the correlation between those two observed variables.

One should recognize that the two diagrams seek to account for the same correlation matrix. The factor model does so by postulating the presence of unseen forces or latent variables. However, we never observe neuroticism, and we understand that all we have is a pattern of higher correlations among those five self-reports. Without compelling evidence for the independent existence of such a latent variable, we might try to avoid making the reification fallacy and look for a different explanation.

The network model provides an alternative account. Perhaps the best overview of this approach can be found at the PsychoSystems Project. From a network perspective, correlations are observed because the nodes mutually interact. This is not a directed graph attempting to separate cause and effect. It is not a causal model. Perhaps in the beginning, there was a causal connection with one node occurring first and impacting the other nodes. But over time, these nodes have come to mutually support one another so that the unique effects of the self-report ratings can no longer be untangled.

Which of these two representations is better? If the observed variables are true reflections of an underlying trait that can be independently established, then the factor model offers a convenient hierarchical model. We think that we are observed five different things, but in fact, we are measuring five different manifestation of one underlying construct. On the other hand, a network of mutually supportive observations cannot be represented using a factor model. There are no factors, and asserting so ends the discussion prematurely. What are the relationships among the separate nodes? How can one intervene to break the cycle? Are there multiple leverage points? In previous posts, I showed how much can be gained using a network visualization of a key driver analysis and how much can be lost relying solely on an input-output regression model. Besides, why would you not generate the map when, as shown below, R makes it so easy to do?


R code to create the two plots:

library(psych)
data(bfi)
 
ratings<-bfi[,1:25]
names(ratings)<-c(
  "Indifferent of others",
  "Inquire about others",
  "Comfort others",
  "Love children",
  "Make people at ease",
  "Exacting in my work",
  "Until perfect",
  "Do by plan",
  "Do halfway",
  "Waste time",
  "Don't talk",
  "Difficult approach others",
  "Know how to captivate people",
  "Make friends",
  "Take charge",
  "Angry easily",
  "Irritated easily",
  "Mood swings",
  "Feel blue",
  "Panic easily",
  "Full of ideas",
  "Avoid difficult reading",
  "Carry conversation higher",
  "Reflect on things",
  "Not probe deeply"
)
 
fa.diagram(principal(ratings, nfactors=5), main="")
 
library(qgraph)
qgraph(cor(ratings, use="pairwise"), layout="spring",
       label.cex=0.9, labels=names(ratings), 
       label.scale=FALSE)

Created by Pretty R at inside-R.org

Friday, January 10, 2014

Finding the R community a barrier to entry, Python looks elsewhere for lunch

Tal Yarkoni's post on "The homogenization of scientific computing, or why Python is steadily eating other languages' lunch" is an enjoyable read of his transition from R to Python. He makes a good case, and I have no argument with his reasoning or the importance of Python in his work. But my experience has not been the same. I am a methodologist working in marketing. I could have called myself a data analyst in the sense that John Tukey used that term back in his 1962 paper on The Future of Data Analysis. Bill Venables speaks of R in a similar manner and quotes Tukey in his keynote at UseR! 2012, "Statistics work is detective work!" I like that description.

So when I turn to R, I am looking for more than code. "The game is afoot!" I require all the usual tools and perhaps something new or from another field of research. As an example, marketing is concerned with heterogeneity because "one size does not fit all." But every field is concerned with heterogeneity. It's the second moment of a distribution. We refer to it as heterogeneity in marketing, but you might call it variability, variation, dispersion, spread, diversity, or individual differences. There are even more words for the attempt to summarize and explain the second moment: density estimation, finite mixtures, seriation, sorting, clustering, grouping, segmenting, graph cutting, partitioning, and tessellation. R has a package for every term, from many differing points of view, and with more on the way every day.

Detective work borrows whatever assists in the hunt. As a marketing scientist trying to understand customer heterogeneity, R provides everything I need for clustering and finite mixture modeling. Moreover, R contributors provide more than a program, writing some of the best and most insightful papers in the field. However, why restrict myself to traditional approaches to understanding heterogeneity when R includes access to archetypal analysis, item response theory, and latent variable mixture models? These are three very different approaches that I can borrow only because they share a common R language.  It is extremely difficult to learn from fields with a different vocabulary. Even if the underlying math is the same, everything is called by a different name. R imposes constraints on the presentation of the material so that comprehension is still difficult but no longer impossible.

Of course, Python also has a mixture package, and perhaps at some point in the future we will see a Python community that will compete with R. Until then, Python will have to skip lunch.


Monday, January 6, 2014

An Introduction to Statistical Learning with Applications in R

Statistical learning theory offers an opportunity for those of us trained as social science methodologists to look at everything we have learned from a different perspective. For example, missing value imputation can be seen as matrix completion and recommender systems used to fill-in empty questionnaire items that were never shown to more than a few respondents by design. It is not difficult to show how to run the R package softImpute that makes all this happen.  But it can be overwhelming trying to learn about the underlying mechanism in enough detail that you have some confidence that you know what you are doing. One does not want to spend the time necessary to become a statistician, yet we need be aware of when and how to use specific models, and what can go wrong, and what to do when something goes wrong. At least with R, one can run analyses on data sets and work through concrete examples.

The publication of An Introduction to Statistical Learning with Applications in R (download the book pdf) provides a gentle introduction with lots of R code. The book achieves a nice balance and well worth looking at both for the beginner and the more experienced needing to explain to others with less training. As a bonus, Stanford's OpenEdX has scheduled a MOOC by Hastie and Tibshirani beginning in January 21 using this textbook.

Thursday, December 19, 2013

Latent Variable Mixture Models (LVMM): Decomposing Heterogeneity into Type and Intensity

Adding features to a product can be costly, so brands have an incentive to include only those features most likely to increase demand. In the last two posts (first link and second link), I have recommended what could be called a "features stress test" that included both a data collection procedure and some suggestions for how to analyze that data.

Although the proposed analysis will work with any rating scale, one should consider replacing the traditional importance measure with behaviorally-anchored categories. That is, we discontinue the importance ratings with its hard-to-know-what-is-meant-by endpoints of very important and very unimportant, and we substitute a sequence of increasingly demanding actions that require consumers to decide how much they are willing to do or sacrifice in order to learn about or obtain a product with the additional feature. For example, as outlined in the first link, respondents choose 1=not interested, 2=nice to have, 3=tie-breaker, or 4=pay more for (an ordinal scale suitable for scaling with item response theory). The modifier "stress" suggests that the possible actions can be made more and more extreme until all but the most desired features fail to pass the test (e.g., "would pay considerably more for" instead of "pay more for"). The resulting data enables us to compare the impact of different features across consumers given that the same feature prioritization holds for everyone.

To be clear, our focus is on the features, and consumers are simply the measuring instrument. What is the impact of Feature A on purchase interest? We ask Consumer #1, and then Consumer #2 and so on. Since every consumer rates every feature, we can allow our consumers to hold varying standards of comparison, as long as all the features rise or fall together. Thus, it does not concern us if some of our consumers are more involved in the product category and report uniformly greater interest in all the features. Interestingly, although our focus was the feature, we learn something about consumer heterogeneity, which could be useful in future targeting.

Mixture of Customer Types with Different Feature Priorities

Our problem, of course, is that feature impact depends on consumer needs and desires. We cannot simply assume that there is one common feature prioritization that is shared by all. We may not wish to go so far as to suggest a unique feature customization for every customer, but certainly we are likely to find a few segments wanting different feature configurations. As a result, my sample of respondents is not uniform but consists of a mixture or composite of two or more customer types with different feature priorities.

The last two posts with links given above have provided some detail outlining why I believe that rating scales are ordinal and why polytomous item response theory (IRT) provides useful models of the response generation process. I have tried in those posts to provide a more gentle introduction into finite mixtures of IRT models, encouraging a more exploratory two-step process of k-means on person-centered scores followed by graded response modeling for each cluster uncovered.

The claim made by a mixture model is that every respondent belongs to a latent class with its own feature prioritization. Yet, we observe only the feature ratings. However, as I showed in the last post, those ratings contain enough information to identify the respondent's latent class and the profile of feature impact for each of the latent classes. Now for the caveat, the ratings contain sufficient information only if we assume that our data are generated as a finite mixture of some small number of unidimensional polytomous IRT models. Fortunately, we know quite a bit about consumer judgment and decision making so that we have some justification for our assumptions other than that the models seem to fit.

The R package mixRasch Does It Simultaneously

Yes, R can recover the latent class as well as provide person and item estimates with one function called mixRasch (the same name is used for both the function and the package). If my ratings were a binary yes/no or agree/disagree, I would have many more R packages available for the analysis (see Section 2.4 for an overview of mixture IRT in R).

The mixRasch() function is straightforward. You tell it the data, the maximum number of iterations, the number of steps or threshold, the IRT model, and the number of latent classes:

mixRasch(ratings, max.iter=500, steps=3, model="PCM", n.c=2)

The R code to generate the data mixing two groups of respondents with different feature priorities can be found in the last post. The appendix at the end of this post lists the additional R code needed to run mixRasch. The number of steps or thresholds is one less than the number of categories. We will be using the partial credit model (PCM), which behaves in practice much like the graded response model, although the category contrasts are not the same and there is that constant slope common to Rasch models. Of course, there is a lot more to be concerned about when using mixRasch and joint maximum likelihood estimation, and perhaps there will be time in a later post to discuss all that can go wrong.  For now, we will look at the output to discover if we have learned anything about different types of feature prioritization and varying levels of intensity with which consumers want those features.

My example uses nine undefined features divided into three sets of three features each.  The first three features have little appeal to anyone in the sample. Consumer heterogeneity is confined to the last six features. The 200 respondents belong to one of two groups:  the first 100 whose preferences follow the ranking of the features from one to nine and the second 100 who preferred most the middle three features. The details can be found in that much-referenced previous post.

Although I deliberately wanted to keep the example abstract, you can personalize it with any product category of your choice. For example, banking customers can be split into online and in-branch types. The in-branch customer wants nearby branches with lots of helpful personnel. The online customer wants free bill-paying and mobile apps. Both types vary in their usage intensity and their product involvement, so that we expect to see differences within each type reflected in the size of the average rating across all the features. If you don't like the banking example, you can substitute gaming or dining or sports or kitchen appliances or just about any product category.

The output from the mixRasch function is a long list with elements needing to be extracted.  First, we want to know the latent class membership for our 200 respondents. It is not all-or-none but a probability of class membership summing to one. For example, the first respondent has a 0.80 likelihood of belonging to the first latent class and a 0.20 probability of being from the second latent class. This information can be found in the list element $class for every respondent except those giving all the features either the highest or lowest scores (e.g., our respondent in row 155 rated every feature with a four and row 159 contained all ones). If we use the maximum probability to classify respondents into mutually exclusive latent classes, the mixRasch function correctly identifies 84% of the respondents (we only know this because we randomly simulated the data). I should mention that the classification from the mixture Rasch model is not identical to the row-centered k-means from the last post, but there is 92% agreement for this particular example.

Finally, were we successful at recovering the feature prioritizations used to simulate the ratings? In the table below, the D1 column contains the difficulty parameters for the 100 respondents in the first segment. The adjacent column LC1 shows the recovered parameter estimates from the 94 respondent in the first latent class.  Similar results are shown for the second segment and latent class in the D2 and LC2 columns. As you may recall, two respondents giving all ones or all fours could not be classified by the mixRasch function.

D1
LC1
D2
LC2
n
100
94
100
104
V1
1.50
1.71
1.50
1.76
V2
1.25
2.06
1.25
1.59
V3
1.00
0.99
1.00
1.18
V4
0.25
0.62
-1.50
-1.79
V5
0.00
0.24
-1.25
-1.49
V6
-0.25
-0.42
-1.00
-0.88
V7
-1.00
-1.35
0.25
0.30
V8
-1.25
-1.69
0.00
-0.35
V9
-1.50
-2.16
-0.25
-0.33


What have we learned from this and the last two posts?

A single screening question will tell me if I should include you in my survey of the wine market. Determining if you are a wine enthusiast will require many more questions, and it is likely that you will need to match a pattern of responses before classification is final. Yet, typing alone will not be adequate since systematic variation reminds after your classification as a wine enthusiast. It's a matter of degree as one moves from novice to expert, from refrigerator to cellar, and from tasting to wine club host. Our clusters are no longer spherical or elliptical clumps or even regions but elongated networks of ever increasing commitment. As noted in one of my first posts on Archetypal Analysis, the admonition that "one size does not fit all" can be applied to both the need for segmentation and the segmentation process itself.  Customer heterogeneity may be more complex than can be represented by either a latent class or a latent trait alone.

The post was titled "Latent Variable Mixture Models" in an attempt to accurately describe the approach being advanced. The book Advances in Latent Variable Mixture Models was published in 2007, so clearly my title is not original. In addition, a paper with the same name from nursing research provides a readable introduction (e.g., depression is identified by a symptom pattern but differs in intensity from mild to severe). Much of this work uses Mplus instead of R. However, we relied on the R package mixRasch in this post, and R has flexmix, psychomix, mixtools and more that all run some form of mixture modeling. Pursuing this topic would take us some time. So, I am including these references more as a postscript because I wanted to place this post in a broader context without having to explain that broader context.


Appendix with R Code


In order to create the data in ratings, you will need to return to the last post and run portions of the R code listed at the end of that post.

library(mixRasch)
 
# need to set the seed only if
# we want the same result each
# time we run mixRasch
set.seed(20131218)
mR2<-mixRasch(ratings, max.iter=500,
              steps=3, model="PCM", n.c=2)
mR2
 
# shows the list structure 
# containing the output
str(mR2)
 
# latent cluster membership
# probility and max classification
round(mR2$class,2)
cluster<-max.col(mR2$class)
 
# comparison with simulated data
table(c(rep(1,100),rep(2,100)),cluster)
 
# comparison with row-centered
# kmeans from last post
table(cluster,kcl_rc$cluster)

Created by Pretty R at inside-R.org

Sunday, December 15, 2013

The Complexities of Customer Segmentation: Removing Response Intensity to Reveal Response Pattern

At the end of the last post, the reader was left assuming respondent homogeneity without any means for discovering if all of our customers adopted the same feature prioritization. To review, nine features were presented one at a time, and each time respondents reported the likely impact of adding the feature to the current product. Respondents indicated feature impact using a set of ordered behaviorally-anchored categories in order to ground the measurement in a realistic market context. This grounding is essential because feature preference is not retrieved from a table with thousands of decontextualized feature value entries stored in memory. Instead, feature preference is constructed as needed during the purchase process using whatever information is available and whatever memories come to mind at the time. Thus, we want to mimic that purchase process when a potential customer learns about a new feature for the first time. Of course, consumers will still respond if asked about features in the abstract without a purchase context. We are all capable of chitchat, the seemingly endless lists of likes and dislikes that are detached from the real world and intended more for socializing than accurate description.

Moreover, the behaviorally-anchored categories insure both feature and respondent differentiation since a more extreme behavior can always be added if too many respondents select the highest category. That is, if a high proportion of the sample tell us that they would pay for all the features, we simply increase the severity of the highest category (e.g., would pay considerably more for) or we could add it as an additional category. One can think of this as a feature stress test. We keep increasing the stress until the feature fails to perform. In the end, we are able to achieve the desired differentiation among the features for only the very best performing features will make it into the highest category. At the same time we are enhancing our ability to differentiation among respondents because only those with the highest demand levels will be selecting the top-box categories.

Customers Wanting Different Features

Now, what if we have customer segments wanting different features? While we are not likely to see complete reversals of feature impact, we often find customers focusing on different features. As a result, many of the features will be rated similarly, but a few features that some customers report as having the greatest influence will be seen as having a lesser impact by others. For instance, some customers attend more to performance, while others customers place some value on performance but are even more responsive to discounts or rewards. However, everyone agrees that the little "extras" have limited worth.

Specifically, in the last post the nine features were arranged in ascending sets of three with difficulty values of {1.50, 1.25, 1.00}, {0.25, 0, -0.25} and {-1.00, -1.25, -1.50}. You might recall that difficulty refers to how hard it is for the feature to have an impact. Higher difficulty is associated with the feature failing the stress test. Therefore, the first feature with a difficulty of 1.50 finds it challenging to boost interest in the product. On the other hand, the difficulty scores for impactful features, such as the last feature, will be negative and large. One interprets the difficulty scale as if it were a z-score because respondents are located on the same scale and their distribution tends toward normal.

Measurement follows from a model of the underlying processes that generate the item responses, which is why it is call item response theory. Features possess "impactability" that is measured on a difficulty scale.  Customers vary in "persuadability" that we measure on the same scale. When the average respondent (with latent trait theta = 0) confronts an average feature (with difficulty d=0), the result is an average observed score for that feature.

I do not know feature impact or customer interest before I collect the rating data. But afterwards, assuming that the features have the same difficulty for everyone, I can use the item scores aggregated across all the respondents to give me an estimate of each feature's impact. Those features with the largest observed impact are the least difficult (large negative values), and those with little effect are the most difficult (large positive values). Then, I can use those difficulty estimates to determine where each respondent is located. Respondents who are impacted by only the most popular features have less interest than respondents responding to the least popular features. How do I know where you fall along the latent persuadability dimension? The features are ordered by their difficulty, as if they were mileposts along the same latent scale that differentiates among respondents. Therefore, your responses to the features tell me your location.

A Simulation to Make the Discussion Concrete

Estimation becomes more complicated when different subgroups want different features. In the item response theory literature, one refers to such an interaction as differential item functioning (DIF). We will avoid this terminology and concentrate on customer segmentation using feature impact as our basis. A simulation will clarify these points by making the discussion more concrete.

A simple example that combines the responses from two segments with the following feature impact weights will be sufficient for our purposes.

F1
F2
F3
F4
F5
F6
F7
F8
F9
Segment 1
1.50
1.25
1.00
0.25
0.00
-0.25
-1.00
-1.25
-1.50
Segment 2
1.50
1.25
1.00
-1.50
-1.25
-1.00
0.25
0.00
-0.25

Segment 1 shows the same pattern as in the last post with the features in ascending impact and gaps separating the nine into three sets of three features as described above. Segment 2 looks similar to Segment 1, except that the middle three features are the most impactful.  This is a fairly common finding that low scoring features, such as Features 1 through 3, tend to lack impact or importance across the entire sample. On the other hand, when there are feature sets with substantial impact on some segment, those features tend to have some value to everyone. Thus, price is somewhat important to everyone, and really important to a specific segment (which we call the price sensitive). At least, this was my rationale for defining these two segments. Whether you accept my argument or not, I have formed two overlapping segment with some degree of similarity because they share the same weights for the first three features.

Cluster Analyses With and Without Response Intensity

As I show in the R code in the appendix, when you run k-means, you do not recover these two segments. Instead, as shown below with the average ratings across the nine features for each cluster profile, k-means separates the respondents into a "want it all" cluster with higher means across all the features and a "naysayer" cluster with lower means across all the features. What has happened?

Cluster means from a kmeans of the 200 respondents
F1
F2
F3
F4
F5
F6
F7
F8
F9
N
Cluster 1
2.27
2.24
2.59
3.33
3.39
3.47
3.18
3.53
3.60
121
Cluster 2
1.16
1.41
1.29
2.01
1.94
1.86
1.96
2.03
2.05
79

We have failed to separate the response pattern from the response intensity. We have forgotten that our ratings are a combination of feature impact and respondent persuadability. The difficulties or feature impact scores specify only the response pattern.  Although this is not a deterministic model, in general, we expect to see F1<F2<F3<F4<F5<F6<F7<F8<F9 for Segment #1 and F1<F2<F3<F7<F8<F9<F4<F5<F6 for Segment #2. In addition, we have response intensity. Respondents vary in their persuadability. Some have high product involvement and tend to response more positively to all the features. Others are more casual users with lowers scores over all the features. In this particular case, response intensity dominates the cluster analysis, and we see little evidence of the response patterns that generated the data.

A quick solution is to remove the intensity score, which is the latent variable theta.  We can use the person mean score across all the features as an estimate of theta and cluster using deviation from the person mean.  In the R code I have named this transformed data matrix "ipsative" in order to emphasize that the nine features scores have lost a degree of freedom in the calculation of the row mean and that we have added some negative correlation among the features because they now must sum to zero.  This time, when we run the k-means on the row centered ratings data, we recover our segment response patterns with 84% of the respondents placed in the correct segment. Obviously, the numbering has been reversed so that Cluster #1 is Segment #2 and Cluster #2 is Segment #1.

Cluster means from a kmeans of the person-centered data matrix
F1
F2
F3
F4
F5
F6
F7
F8
F9
N
Cluster 1
-0.78
-0.77
-0.54
0.89
0.74
0.44
-0.24
0.15
0.12
86
Cluster 2
-0.65
-0.53
-0.42
-0.21
-0.08
0.18
0.45
0.58
0.68
114

Clustering using person-centered data may look similar to other methods that correct for halo effects. But that is not what we are doing here.  We are not removing measurement bias or controlling for response styles. Although all ratings contain error, it is more likely that the consistent and substantial correlations observed among such ratings are the manifestation of an underlying trait. When the average ratings co-vary with other measures of product involvement, our conclusion is that we are tapping a latent trait that can be used for targeting and not a response style that needs to be controlled. It would be more accurate to say that we are trying to separate two sources of latent variation, the response pattern (a categorical latent variable) and response intensity (a continuous latent variable).

Why do I claim that response pattern is categorical and not continuous?  Even with only a few features, there are many possible combinations and thus more than enough feature prioritization configurations to form a continuum.  Yet, customers seem to narrow their selection down to only a few feature configurations that are marketed by providers and disseminated in the media. This is why response pattern can be treated as a categorical variable. The consumer adopts one of the feature prioritization patterns and commits to it with some level of intensity. Of course, they are free to construct their own personalized feature prioritization, but consumers tend not to do so. Given the commonalities in usage and situational constraints, along with limited variation in product offerings, consumers can simplify and restrict their feature prioritization to only a few possible configurations. In the above example, one selects an orientation, Features 4, 5 and 6 or Features 7, 8, and 9.  How much one wants either set is a manner of degree.

Maps from Multiple Correspondence Analysis (MCA)

It is instructive to examine the multiple correspondence map that we would have produced had we not run the person-centered cluster analysis, that is, had we mistakenly treated the 200 respondents as belonging to the same population. In the MCA plot for the 36 feature categories displayed below, we continue to see separation between the categories so that every feature shows the same movement from left to right along the arc as one moves from the lowest dark red to the highest dark blue (i.e., from 1 to 4 on the 4-point ordinal scale). However, the features no longer are arrayed in order from Feature 9 through Feature 1, as one would expect given that only 100 of the 200 respondents preferred the features in the original rank ordering from 1 through 9. The first three features remain in order because there is agreement across the two segments, but there is greater overlap among the top six features than that which we observed in the prior post where everyone belong to Segment 1. Lastly, we see the arc from the previous post that begins in the upper left corner and extends with a bend in the middle until it reaches the upper right corner. We obtain a quadratic curve because the second dimension is a quadratic function of the first dimension.


Finally, I wanted to use the MCA map to display the locations of the respondents and show the differences between the two cluster analyses with and without row centering. First, without row centering the two clusters are separated along the first dimension representing response intensity or the average rating across all nine features. Our first cluster analysis treated the original ratings as numeric and identified two groupings of respondents, the naysayer (red stars) and those wanting it all (black stars). This plot confirms that response intensity dominates when the ratings are not transformed.


Next, we centered our rows by calculating each respondent's deviation score about their mean rating and reran the cluster analysis with these transformed ratings. We were able to recover the two response generation processes that were used to simulate the data. However, we do not see those segments as clusters in the MCA map. In the map below, the red and black stars from the person-centered cluster analysis form overlapping arcs of respondents spread out along the first dimension. MCA does not remove intensity. In fact, response intensity is the latent trait responsible for the arc.


The final plot presents the results from the principal component analysis for the person-centered data. The dimensions are the first two principal components and the arrows represent the projections of the features onto the map. Respondents in the second cluster with higher scores on the last three features are shown in red, and the black stars indicate respondents from the first cluster with preferences for the middle three features. This plot from the principal component analysis on the person-centered ratings demonstrates how deviation scores remove intensity and reveal segment differences in response pattern.


Appendix with R code needed to run the above analyses.
For a description of the code, please see my previous post.

library(psych)
d1<-c(1.50,1.25,1.00,.25,0,-.25,
      -1.00,-1.25,-1.50)
d2<-c(1.50,1.25,1.00,-1.50,-1.25,-1.00,
      .25,0,-.25)
 
set.seed(12413)
bar1<-sim.poly.npl(nvar = 9, n = 100, 
                   low=-1, high=1, a=NULL, 
                   c=0, z=1, d=d1, 
                   mu=0, sd=1.5, cat=4)
bar2<-sim.poly.npl(nvar = 9, n = 100, 
                   low=-1, high=1, a=NULL, 
                   c=0, z=1, d=d2, 
                   mu=0, sd=1.5, cat=4)
rating1<-data.frame(bar1$items+1)
rating2<-data.frame(bar2$items+1)
apply(rating1,2,table)
apply(rating2,2,table)
ratings<-rbind(rating1,rating2)
 
kcl<-kmeans(ratings, 2, nstart=25)
kcl
 
rowmean<-apply(ratings, 1, mean)
ipsative<-sweep(ratings, 1, rowmean, "-")
round(apply(ipsative,1,sum),8)
kcl_rc<-kmeans(ipsative, 2, nstart=25)
kcl_rc
table(c(rep(1,100),rep(2,100)),
      kcl_rc$cluster)
 
F.ratings<-data.frame(ratings)
F.ratings[]<-lapply(F.ratings, factor)
 
library(FactoMineR)
mca<-MCA(F.ratings)
 
categories<-mca$var$coord[,1:2]
categories[,1]<--categories[,1]
categories
 
feature_label<-c(rep(1,4),rep(2,4),rep(3,4),
                 rep(4,4),rep(5,4),rep(6,4),
                 rep(7,4),rep(8,4),rep(9,4))
category_color<-rep(c("darkred","red",
                      "blue","darkblue"),9)
category_size<-rep(c(1.1,1,1,1.1),9)
plot(categories, type="n")
text(categories, labels=feature_label, 
     col=category_color, cex=category_size)
 
mca2<-mca
mca2$var$coord[,1]<--mca$var$coord[,1]
mca2$ind$coord[,1]<--mca$ind$coord[,1]
plot(mca2$ind$coord[,1:2], col=kcl$cluster, pch="*")
plot(mca2$ind$coord[,1:2], col=kcl_rc$cluster, pch="*")
 
pca<-PCA(ipsative)
plot(pca$ind$coord[,1:2], col=kcl_rc$cluster, pch="*")
arrows(0, 0, 3.2*pca$var$coord[,1], 
       3.2*pca$var$coord[,2], col = "chocolate", 
       angle = 15, length = 0.1)
text(3.2*pca$var$coord[,1], 3.2*pca$var$coord[,2],
     labels=1:9)
Created by Pretty R at inside-R.org