Saturday, September 13, 2014

The Ecology of Data Matrices: A Metaphor for Simultaneous Clustering

"...a metaphor is an affair between a predicate with a past and an object that yields while protesting." 

It is, as if, data matrices were alive. The rows are species, and the columns are habitats. At least that seems to be the case with recommender systems. Viewers seek out and watch only those movies that they expect to like while remaining unaware of most of what is released or viewed by others. The same applies to music. There are simply too many songs to listen to them all. In fact, most of us have very limited knowledge of what is available. Music genre may have become so defining of who we are and who we associate with that all of us have become specialists with little knowledge of what is on the market outside of our small circle.

Attention operates in a similar manner. How many ads are there on your web page or shown during your television program? Brand recognition is not random for we are drawn to the ones we know and prefer. We call it "affordance" when the columns of our data matrices are objects or activities: what to eat on a diet or what to do on a vacation. However, each of us can name only a small subset of all that can be eaten when dieting or all that can be done on a vacation. Preference is at work even before thinking and is what gets stuff noticed.

Such data create problems for statistical modeling that focuses solely on the rows or on the columns and treats the other mode as fixed. For example, cluster analysis takes the columns as given and calculates pairwise distances between rows (hierarchical clustering) or distances of rows from cluster centroids (kmeans). This has become a serious concern for the clustering of high dimensional data as we have seen with the proliferation of names for the simultaneous clustering of rows and columns: biclustering, co-clustering, subspace clustering, bidimensional clustering, block clustering, two-mode or two-way clustering and many more. The culprit is that the rows and columns are confounded sufficiently that it makes less and less sense to treat them as independent entities. High dimensionality only makes the confounding more pronounced.

As a marketing researcher, I work with consumers and companies who have intentions and act on them. Thus, I find the ecology metaphor compelling. It is not mandatory, and you are free to think of the data matrix as a complex system within which rows and columns are coupled. Moreover, the ecology metaphor does yield additional benefits since numerical ecology has a long history of trying to quantify dissimilarity given the double-zero problem. The Environmetrics Task View list the R packages dealing with this problem under the subheading Dissimilarity Coefficients. Are two R programmers more or less similar if neither of them has any experience with numerical ecology in R? This is the double-zero problem. The presence of two species in a habitat does not mean the same as the absence of two species. One can see similar concerns raised in statistics and machine learning under the heading "the curse of dimensionality" (see Section 3 of this older but well-written explanation).

Simultaneous Clustering in R

In order to provide a lay of the land, I will name at least five different approaches to simultaneous clustering. Accompanying each approach is an illustrative R package. The heading, simultaneous clustering, is meant to convey that the rows and columns are linked in ways that ought to impact the analysis. However, the diversity of the proposed solutions makes any single heading unsatisfying.
  1. Matrix factorization (NMF),
  2. Biclustering (biclust),
  3. Variable selection  (clustvarsel),
  4. Regularization (sparcl) , and
  5. Subspace clustering (HDclassif).
Clearly, there is more than one R package in each category, but I was more interested in an example than a catalog. I am not legislating; you are free to sort these R packages into your own categories and provide more or different R packages. I have made some distinctions that I believe are important and selected the packages that illustrate my point. I intend to be brief.

(1) Nonnegative matrix factorization (NMF) is an algorithm from linear algebra that decomposes the data matrix into the product of a row and column representation. If one were to separate clustering approaches into generative models and summarizing techniques, all matrix factorizations would fall toward the summary side of the separation. My blog is full of recent posts illustrating how well NMF works with marketing data.

(2) Biclustering has the feel of a Rubik's Cube with row and columns being relocated. Though the data matrix is not a cube, the analogy works because one gets the dynamics of moving entire rows and columns all at the same time. In spite of the fact that the following figure is actually from the BlockCluster R package, it illustrates the concept. Panel a contains the data matrix for 10 rows labeled A through J and 7 numbered columns. Panel b reorders the rows (note for instance that row B is move down and row H is moved up). Panel c continues the process by reordering some columns so that they follow the pattern summarized schematically in Panel d. To see how this plays out in practice, I have added this link to a market segmentation study analyzed with the biclust R package.

Now, we return to generative models. (3) Variable selection is a variation on the finite mixture model from the model-based clustering R package mclust. As the name implies, it selects those variables needed to separate the clusters. The goal is to improve performance by removing irrelevant variables. More is better only when the more does not add more noise, in which case, more blurs the distinctions among the clusters. (4) Following the same line of reasoning, sparse clustering relies on a lasso-type penalty (regularization) to select features by assigning zero or near zero weights to the less differentiating variables. The term "sparse" refers to the variable weights and not the data matrix. Both variable selection and spare clustering deal with high dimensionality by reducing the number of variables contributing to the cluster solution.

(5) This brings us to my last category of subspace clustering, which I will introduce using the pop-up toaster as my example. Yes, I am speaking of the electric kitchen appliance with a lever and slots for slices of bread and other toastables ("let go of my eggo"). If you have a number of small children in the family, you might care about safety features, number of slots and speed of heating. On the other hand, an upscale empty nester might be concerned about the brand or how it looks on the counter when they entertain. The two segments reside in different subspaces, each with low dimensionality, but defined by different dimensions. The caring parent must know if the unit automatically shuts off when the toaster falls off the counter. The empty nester never inquires and has no idea.

None of this would be much of an issue if it did not conceal the underlying cluster solution. All measurement adds noise, and noise makes the irrelevant appear to have some minor impact. The higher the data dimensionality, the greater the distortion. Consumers will respond to every question even when asked about the unattended, the inconsequential or the unknown. Random noise is bad; systematic bias is even worst (e.g., social desirability, acquiescence and all the other measurement biases). Sparse clustering pushes the minor effects toward zero. Subspace clustering allows the clusters to have their own factor structures with only a very few intrinsic dimensions (as HDclassif calls it). For one segment the toaster is an appliance that looks good and prepares food to taste. For the other segment it is a way to feed quickly and avoid complaints while not getting anyone injured in the process. These worlds are as incommensurable as ever imagined by Thomas Kuhn.

No comments:

Post a Comment