Sunday, March 23, 2014

Warning: Clusters May Appear More Separated in Textbooks than in Practice

Clustering is the search for discontinuity achieved by sorting all the similar entities into the same piles and thus maximizing the separation between different piles. The latent class assumption makes the process explicit. What is the source of variation among the objects? An unseen categorical variable is responsible. Heterogeneity arises because entities come in different types. We seem to prefer mutually exclusive types (either A or B), but will settle for probabilities of cluster membership when forced by the data (a little bit A but more B-like). Actually, we are more likely to acknowledge that our clusters overlap early on and then forget because it is so easy to see type as the root cause of all variation.

I am asking the reader to recognize that statistical analysis and its interpretation extend over time. If there is variability in our data, a cluster analysis will yield partitions. Given a partitioning, a data analyst will magnify those differences by focusing on contrastive comparisons and assigning evocative names. Once we have names, especially if those names have high imagery, can we be blamed for the reification of minor distinctions? How can one resist segments from Nielsen PRIZM with names like "Shotguns and Pickups" and "Upper Crust"? Yet, are "Big City Blues" and "Low-Rise Living" really separate clusters or simply variations on a common set of dwelling constraints?

Taking our lessons seriously, we expect to see the well-separated clusters displayed in textbooks and papers. However, our expectations may be better formed than our clusters. We find heterogeneity, but those differences are not clumping or distinct concentrations. Our data clouds can be parceled into regions, although those parcels run into one another and are not separated by gaps. So we name the regions and pretend that we have assigned names to types or kinds of different entities with properties that control behavior over space and time. That is, we have constructed an ontology specifying categories to which we have given real explanatory powers.

Consider the following scatterplot from the introductory vignette in the R package mclust. You can find all the R code needed to produce these figures at the end of this post.

This is the Old Faithful geyser data from the "datasets" R package showing the waiting time in minutes between successive eruptions on the y-axis and the duration of the eruptions along the x-axis. It is worth your time to get familiar with Old Faithful because it is one of those datasets that gets analyzed over and over again using many different programs. There seems to be two concentrations of points: shorter eruption that occurs more quickly and longer eruptions that have a longer waiting period. If we told the Mclust function from the mclust package that the scatterplot contains observations from G=2 groups, the function would produce a classification plot that looked something like this: 

The red and the blue with their respective ellipses are the two normal densities that are getting mixed. It is such a straightforward example of finite mixture or latent class models (as these models are also called by analysts in other fields of research). If we discovered that there were two pools feeding the geyser, we could write a compelling narrative tying all the data together.

The mclust vignette or manual is comprehensive but not overly difficult. If you prefer a lecture, there is no better introduction to finite mixture than the MathematicalMonk YouTube video. The key to understanding finite mixture models is recognizing that the underlying latent variable responsible for the observed data is categorical, a latent class which we do not observe, but which explains the location and shape of the data points. Do you have a cold or the flu? Without medical tests, all we can observe are your symptoms. If we filled a room with coughing, sneezing, achy and feverish people, we would find a mixture of cold and flu with differing numbers of each type.

This appears straightforward, but for the general case, how does decide how many categories, in what proportions, and with what means and covariance matrices? That is, those two ellipses in the above figure are drawn using a vector with means for the x- and y-axes plus a 2x2 covariance matrix. The means move the ellipse over the space, and the covariance matrix changes the shape and orientation of the ellipse. A good summary of the possible forms is given in Table 1 of the mlcust vignette. 

Unfortunately, the mathematics of the EM algorithm used to solve this problem gets complicated quickly. Fortunately, Chris Bishop provides an intuitive introduction in a 2004 video lecture. Starting at 44:14 of Part 4, you will find a step-by-step description of how the EM algorithm works with the Old Faithful data. Moreover, in Chapter 9 of his book, Bishop cleverly compares the workings of the EM and k-means algorithms, leaving us with a better understanding of both techniques.

If only my data showed such clear discontinuity and could be tied to such a convincing narrative.

Product Categories Are Structured around the Good, the Better, and the Best

Almost every product category offers a range of alternatives that can be described as good, better, and best. Such offerings reflect the trade-offs that customers are willing to make between the quality of the products they want and the amount that they are ready to spend. High-end customers demand the best and will pay for it. On the other end, one finds customers with fewer needs and smaller budgets accepting less and paying less. Clearly, we have heterogeneity, but are those differences continuous or discrete? Can we tell by looking at the data?

Unlike the Old Faithful data with well-separated grouping of data points, product quality-price trade-offs look more like the following plot of 300 consumers indicating how much they are willing to spend and what they expect to get for their money (i.e., product quality is a composite index combining desired features and services).

There is a strong positive relationship between demand for quality and willingness to pay, so a product manager might well decide that there was opportunity for at least a high-end and a low-end option. However, there is no natural breaks in the scatterplot. Thus, if this data cloud is a mixture of distinct distributions, then these distributions must be overlapping. 

Another example might help. As shown by John Cook, the distribution of heights among adults is a mixture two overlapping normal distribution, one for men and another for women. Yet, as you can observe from Cook's plots, the mixture of men's and women's height does not appear bimodal because the separation between the two distributions is not large enough. If you follow the links in Cook's post, eventually you will find the paper "Is Human Height Bimodal?", which clearly demonstrates that many mixtures of distributions appear to be homogeneous. We simply cannot tell that they are mixture by looking just at the shape of the distribution for the combined data. The Old Faithful data with its well-separated bimodal curve provides a nice contrast, especially when we focus only on waiting time as a single dimension (Fitting Mixture Models with the R Package mixtools).

Perhaps then, segmentation does not require gaps in the data cloud. As Wedel and Kamakura note, "Segments are not homogeneous groupings of customers naturally occurring in the marketplace, but are determined by the marketing manager's strategic view of the market." That is, one could look at the above scatterplot, see the presence of a strong first principal component from the closest of the data points to the principal axis of the ellipse and argue that customer heterogeneity is a single continuous dimension running from the low- to the high-end of the product spectrum. Or, one could look at the same scatterplot and see three overlapping segments seeking basic, value and premium products (good, better, best). Let's run mclust and learn if we can find our three segments.

When instructed to look for three clusters, the Mclust function returned the above result. The 300 observed points are represented as a mixture of three distributions falling along the principal axis of the larger ellipse formed by all the observations. However, if I had not specified three clusters and asked Mclust to use its default BIC criterion to select the number of segments, I would have been told that there was no compelling evidence for more than one homogeneous group. Without any prior specification Mclust would have returned a single homogeneous distribution, although as you can see from the R code below, my 300 observations were a mixture of three equal size distributions falling along the principal axis and separated by one standard deviation.

Number of Segments = Number Yielding Value to Product Management

Market segmentation lies somewhere between mass marketing and individual customization. When mass marketing fails because customers have different preferences or needs and customization is too costly or difficult, the compromise is segmentation. We do not need "natural" grouping, but just enough coherence for customers to be satisfied by the same offering. Feet come in many shapes and sizes. The shoe manufacturer can get along with three sizes of sandals but not three sizes of dress shoes. It is not the foot that is changing, but the demands of the customer. Thus, even if segments are no more than convenient fictions, they can be useful from the manager's perspective.

My warning still holds. Reification can be dangerous. These segments are meaningful only within the context of the marketing problem created by trying to satisfy everyone with products and services that yield maximum profit. Some segmentations may return clusters that are well-separated and represent groups with qualitatively different needs and purchase processes. Many of these are obvious and define different markets. If you don't have a dog, you don't buy dog food. However, when segmentation seeks to identify those who feel that their dog is a member of the family, we will find overlapping clusters that we treat differently not because we have revealed the true underlying typology, but because it is in our interest. Don't be fooled into believing that our creative segment names reveal the true workings of the consumer mind.

Finally, what is true for marketing segmentation is true for all of cluster analysis. "Clustering: Science or Art?" (a 2009 NIPS workshop) raises many of these same issues for cluster analysis in general. Videos of this workshop are available at Videolectures. Unlike supervised learning with its clear criterion for success and failure, clustering depends on users of the findings to tell us if the solution is good or bad, helpful or not. On the one hand, this seems to make everything more difficult. On the other hand, it frees us to be more open to alternative methods for describing heterogeneity as it is now and how it evolves over time.  

We seek to understand the dynamic structure of diversity, which only sometimes takes the form of cohesive clusters separated by gaps. Other times, a model with only continuous latent variables seems to be the best choice (e.g., brand perceptions). And, not unexpectedly, there are situations where heterogeneity cannot be explained without both categorical and continuous latent variables (e.g., two or more segments seeking alternative benefit profiles with varying intensities).

Yet, even these three combinations cannot adequately account for all the forms of diversity we find in consumer data.  Innovation might generate a structure appearing more like long arrays or streams of points seemingly pulled toward the periphery by an archetypal ideal or aspirational goal. And if the coordinate space of k-means and mixture models becomes too limiting, we can replace it with pairwise dissimilarity and graphical clustering techniques, such as affinity propagation or spectral clustering. Nor should we be wedded to the stability of our segment solution when those segments were created by dynamic forces that continue to act and alter its structure. Our models ought to be as diverse as the objects we are studying.

R code for all figures and analysis

#attach faithful data set
plot(faithful, pch="+")
#run mclust on faithful data
faithfulMclust<-Mclust(faithful, G=2)
summary(faithfulMclust, parameters=TRUE)
#create 3 segment data set
sigma <- matrix(c(1.0,.6,.6,1.0),2,2)
mydata1<-mvrnorm(n=100, mean1, sigma)
mydata2<-mvrnorm(n=100, mean2, sigma)
mydata3<-mvrnorm(n=100, mean3, sigma)
colnames(mydata)<-c("Desired Level of Quality",
                    "Willingness to Pay")
plot(mydata, pch="+")
#run Mclust with 3 segments
mydataClust<-Mclust(mydata, G=3)
summary(mydataClust, parameters=TRUE)
#let Mclust decide on number of segments
summary(mydataClust, parameters=TRUE)

Created by Pretty R at


  1. I am running a conjoint study where I have individual model scores for each person. So each person has a constant and then 25 coefficients. I would like to cluster them by the data. I was thinking of doing a factor analysis of the coefficients and then k-means clustering using Pearson distance on the factors. Is this a good approach? What would you recommend?

    1. It depends. If the coefficients were estimated using hierarchical Bayes, you already have a model of individual heterogeneity. You probably estimated the coefficients assuming that individual coefficients were normally distributed about mean group-level values. It is important not to forget that hierarchical Bayes is a model of individual heterogeneity. One does not assume a single homogenous population of coefficients in order to derive estimates and then run a cluster analysis looking for heterogeneity in that homogeneous population.

      If the coefficients were estimated separately for each respondent using only that respondent’s data, you may still have complications if your coefficients represent the effects of categorical variables. That is, unless you used some form of orthogonal coding of categorical effects, your coefficients are correlated by design. The raw data for a conjoint is the rating of the individual profiles. The coefficients are derived from those profile ratings or choices using the factorial structure of those profiles. Consequently, the coefficient space is likely to be constrained, raising issues for factor analysis, which depends on an independence assumption.

      In this case, where we have independent estimates for each individual and not pooled estimates as in hierarchical Bayes, I would recommend using a dissimilarity matrix and a graphical clustering procedure. For example, affinity propagation ought to yield results similar to k-means and the Orange County R User group has a Webinar introducing the R package “apcluster” on YouTube. You can start with the coefficients and calculate dissimilarity as the distance between coefficient profiles. If this seems too much, why not use the dist() function to calculate distances and run a hierarchical clustering with those distances. One can learn a lot from a dendrogram. Alternatively, you can do it all in one step using clustered heatmaps in R (search for “drawing heatmaps in R” in R Bloggers). I hope this helps.

  2. Thank you for the reply. I am not a statistician so please pardon anything that may be obvious or redundant. The 36 coefficients for each respondent were generated from an orthogonal design using dummy variable regression. The variables were split up into 6 categories with 6 tested statements in each. The design is such, that at most only 1 statement from each category can be present. The design also allows for a 0 condition meaning not call categories are always present for each concept rated by a respondent. So there is little correlation between the 6 categories tested since not all of them are present all of the time. Each respondent had their own experimental design of 48 concepts which were dummy coded. So each person is made up of 48 rows, with 36 columns (each column is a dummy variable of 1 or 0). And then each row is has a rating based on a 9 point scale. Does this change anything?

  3. If I understand your design, you have 6 factors with 7 levels each (no statement or 1 of 6 statements). You dummy coded the 7 levels using 6 dummies and one of the levels is the base with all 0s. This yields 6x6 or 36 estimates. Problem: the dummy coding for any one factor is a comparison with the level that is all zeros. You do not want to factor analyze such a variable set. You seem to be familiar with k-means, so why not run k-means of the 48 profile ratings on your 9 point scale? That is, if I am correct, everyone rated the same 48 profiles. Why not use the ratings as your basis for the segmentation? You can profile the resulting segments using your individual-level coefficients.

  4. Thank you again. I have a 7th level 0 so my constant value is based on no statements. If I don't have a 0 condition, then my constant will be based on 1 level from each factor. I would like the constant to be based on none of the levels. Regarding the clustering, each person has a different design. This was done so there is no design bias and so that many different statement combinations get tested by the total sample (as opposed to being limited to the same 48 concepts). I don't think I can cluster based on the 48 ratings by each respondent in this case.

  5. OK, I think I understand your study now. We cluster using factor scores when we believe that the observed variables are error-prone indicators of latent variables that are the actual drivers of the data. For example, my questionnaire contains 10 ratings of quality (because it is so easy to write these questions) but only two questions about price (because it is more difficult to ask about price). But we believe that price and quality ought to receive equal weight when calculating similarity among respondents. Factor scores are one way to accomplish this. Otherwise, quality gets counted 5 times as often as price when calculating the distance between two respondents (assuming same scale with equal variability).

    This is not the case in a conjoint study. The factor levels are the determinants of the ratings, and you have a 36-dimensional coefficient vector that can be input into k-means (assuming that you used a linear regression to calculate the coefficients). Each respondent is described by a 36-dimensional coefficient vector. The closer the two vectors, the more similar the respondents. Because individuals rated different profiles, you cannot use profiles. Instead you use the profile ratings to calculate the coefficients that you believe are the drivers of the ratings. At least that is argument that is usually made to justify this type of cluster analysis.

    1. Yes. So I was thinking to do factors first because if certain variables follow the same pattern. My thinking is that the factors would group then together so let's say if 15 of the levels are correlated, then they would become 1 factor and would not overweight the clustering.

    2. I consistently run into the same problems when I try to cluster utilities from message testing conjoint designs. First, there is always a sizeable group of flat-liners with little variations in the ratings they give the profiles. Second, the most important factor in the aggregate tends to be the most important factor for everyone. Often, this is true for the second and third most important factors. As a result, my clusters tend to have the same pattern of utilities except that the size of the utilities increases for the different clusters. I find the same pattern for k-means with and without factor scores. I have tried archetypal analysis and affinity propagation. I get the same result. Clearly, this is not what one would expect. Is it the mind-numbing effects of repeated presentations of similar profiles? Hopefully, you will have better luck. Let me know.

    3. Agreed :). One thing I have tried is that when I do a k-means cluster using Systat, it allows me to select if I would like to use a Pearson measure (as opposed to Euclidean or whatever). From what I understand, this will group all people who display the same coefficient data patterns so when clusters are created, it does not end up where the clusters are similar to each other with one being more/less extreme than the other, rather the coefficients for each cluster ends up being different than the other. I am not sire if that is always the case though.