Tuesday, July 29, 2014

Variable Selection in Market Segmentation: Clustering or Biclustering?

Will you have that segmentation with one or two modes?

The data matrix for market segmentation comes to us with two modes, the rows are consumers and the columns are variables. Clustering uses all the columns to transform the two-mode data matrix (row and columns are different) into a one-mode distance matrix (rows and columns are the same) either directly as in hierarchical clustering or indirectly as in k-means. The burden falls on the analyst to be judicious in variable selection since too few will miss distinctions that ought to be made and too many will blur those distinctions that need to be made.

This point is worth repeating. Most data matrices have two modes with the rows and columns referring to different types. In contrast, correlation and distance matrices have only one mode with both the rows and columns referring to the same entities. Although a market segmentation places its data into a two-way matrix with the rows as consumers and the columns as the variables, the intent is to cluster the rows and ignore the columns once they have been used to define the distances between the rows. All the columns enter that distance calculation, which is why variable selection becomes so important in market segmentation.

Such an all-or-none variable selection may seem too restrictive given the diversity of products in many categories. We would not be surprised to discover that what differentiates one end of the product category is not what separates consumers at the other end. Examples are easy to find. Credit card usage can be divided into those who pay their bill in full each month and those who pay interest on outstanding balances. The two groups seek different credit card features, although there is some overlap. Business travelers are willing to pay for benefits that would not interest those on vacation, yet again one finds at least some commonality. Additional examples can be generated without difficulty for many product categories contain substantial heterogeneity in both their users and their offerings.

Biclustering offers a two-mode alternative that allows different clusters to be defined by different variables. The "bi" refers to the joint clustering of both rows and columns at the same time. All the variables remain available for selection by any cluster, but each cluster exercises its own option to incorporate or ignore. So why not refer to biclustering as two-mode clustering or simultaneous clustering or co-clustering? They do. Names proliferate when the same technique is rediscovered by diverse disciplines or when different models and algorithms are used. As you might expect, R offers many options (including biclust for distance-based biclustering). However, I will focus on NMF for factorization-based biclustering, a package that has proven useful with a number of my marketing research projects.

Revisiting K-means as Matrix Factorization

For many of us k-means clustering is closely associated with clouds of points in two-dimensional space. I ask that you forget that scatterplot of points and replace it with the following picture of matrix multiplication from Wikipedia:
What does this have to do with k-means? The unlabeled matrix in the lower right is the data matrix with rows as individuals and columns as variables (4 people with 3 measures). The green circle is the score for the third person on the third variable. It is calculated as a(1,1)*b(1,3)+a(3,2)*b(2,3). In a k-means A is the membership matrix with K columns as clusters indicators. Each row has a single entry with a one indicating its cluster membership and K-1 zeros for the other clusters. If Person #3 has been classified as cluster #1, then his or her third variable reduces to b(1,3) since a(1,1)=1 and a(3,2)=0. In fact, the entire row for this individual is simply a copy of the first row for B, which contains the centroid for the first cluster.

With two clusters k-means gives you one of two score profiles. You give me the cluster number by telling me which column of A is a one, and I will read out the scores you ought to get from the cluster profile in the correct row of B. The same process is repeated when there are more clusters, and the reproduced data matrix contains only K unique patterns. Cluster membership means losing your identity and adopting the cluster centroid as your data profile.

With a hard all-or-none cluster membership, everyone in the same cluster ought to have the same pattern of scores except for random variation. This would not be the case with soft cluster membership, that is, the rows of the membership matrix A would still sum to one but the entries would be probabilities of cluster membership varying from zero to one. Similarly, B does not need to be the cluster centroid. The rows of B could represent an archetype, an extreme or unusual pattern of scores. Archetypal analysis adapts such an approach and so does nonnegative matrix factorization (NMF), although the rows of B have different constraints. Both techniques are summarized in Section 14.6 of the online book Elements of Statistical Learning.

Given the title for this post, you might wish to know what any of this has to do with variable selection. The nonnegative in NMF restricts all the values of all three matrices to be either zero or a positive number: the data matrix contains counts or quantities, cluster membership is often transformed to look like a probability varying between 0 and 1, and the clusters are defined by always adding variables or excluding them entirely with a zero coefficient. As a result, we find in real applications that many of the coefficients in B are zero indicating that the associated variable has been excluded. The same is true for the A matrix suggesting a simultaneous co-clustering of the columns and rows forming high-density sub-matrices by rearranging these rows and columns in order to appear as homogeneous blocks. You can see this in the heatmaps from the NMF R package with lots of low-density regions and only a few high-density blocks.

Expanding the Definition of a Cluster

Biclustering shifts our attention away from the scatterplot and concentrates it directly on the data matrix, specifically, how it might be decomposed into components and then recomposed one building block at a time. Clusters are no longer patterns discovered in cloud of points plotted in some high-dimensional space and observed piecemeal 2 or 3 dimensions at a time. Nor are they sorts into groups or partitions into mixtures of different distributions. Clusters have become components that are combined together additively to reproduce the original data matrix.

In the above figure, the rows of B define the clusters in terms of the observed variables in the columns of B, and A shows how much each cluster contributes to each row of the data matrix. For example, a consumer tells us what information they seek when selecting a hotel. Biclustering sees that row of the data matrix as a linear combination of a small set of information search strategies. Consumers can hold partial membership in more than one cluster, and what we mean by belonging to a cluster is adopting a particular information search strategy. A purist relies on only one strategy so that their row in the cluster membership matrix will have one value close to one. Other consumers will adopt a mixture of strategies with membership spread across two or more row entities.

previous post provides more details, and there will be more to come.


  1. Very insightful! Although we do Segmentation exercises I've never actually thought of using a Bi-Clust. What would be your opinion in terms of translating a bi-clust model to make real world sense to clients. They already have a hard time understanding a regular clustering exercise! But am looking forward to some example that you would showcase !!

    1. I just posted an example using a dataset from a R package that you can access and run your own analyses. The post is called "Customer Segmentation Using Purchase History." I hope it helps.

  2. Nice write-up. From a substantial perspective, is their a difference to the traditional market segmentation approach, i.e. apply first a factor analysis and then the cluster algorithm?
    As highlighted by the previous comment, an example illustrating meaningful differences between the two approaches in a real-world scenario would be a great follow-up.

    1. I just posted a follow-up that ought to help. Biclustering calculates similarity between respondents using only some of the variables. Traditional clustering calculates distances using all the variables. Obviously, it is more complicated than this one point. But if forced to pick one aspect that separates biclustering from clustering, I would focus on biclustering’s attempt to differentiate among individuals using different variables. Put simply, biclustering ignores all the many variables with small discrepancies that get summed together to mislead and concentrates on the few variables with the largest differences. The language is a bit figurative, though it makes the point.

  3. Excellent! The point you made amplifies our possibilities in tying to capture the complexity of real-world data...thank you