Each row of our data matrix contains the measurements for a different object, represented by the vector x in the above equation. If all the rows came from a single normal distribution, then we would not need the subscript K. However, we have a mixture of populations so that measurements come from one of the K groups with probability given by the Greek letter pi. If we knew K, then we would know the mean mu and covariance matrix sigma that describe the Gaussian distribution generating our observation.
The above graphical model attempts to illustrate the entire process using plate notation. That is, the K and the N in the lower right corner of the two boxes indicate that we have chosen not to show all of the K or N different boxes, one for each group and one for each observation, respectively. The arrows represent directed effects so that group membership in the box with [K] is outside the measurement process. With K known, the corresponding mean and variance act as input to generate one of the i = 1,...,N observations.
This graphical model describes a production process that may be responsible for our data matrix. We must decide on a value for K (the number of clusters) and learn the probabilities for each of the K groups (pi is a K-valued vector). But we are not done estimating parameters. Each of the K groups has a mean vector and a variance-covariance matrix that must be estimated, and both depend on the number of columns (p) in the data matrix: (1) Kp means and (2) Kp(p+1)/2 variances and covariances. Perhaps we should be concerned that the number of parameters increases so rapidly with the number of variables p.
A commonly used example will help us understand the equation and the graphical model. The Old Faithful dataset included with the R package mclust illustrates that eruptions from the geyser can come from one of two sources: the brief eruptions in red with shorter waiting times and the extended eruptions in blue with longer waiting periods. There are two possible sources (K=2), and each source generates a bivariate normal distribution of eruption duration and waiting times (N=number of combined red squares and blue dots). Finally, our value of pi can be calculated by comparing the number of red and blue points in the figure.
Scalability Issues in High Dimensions
The red and the blue eruptions reside in the same two-dimensional space since the clustering depends only on duration. This would not be the case with topic modeling, for example, where each topic might be defined by a specific set of anchor words that would separate each topic from the rest. Similarly, if we were to cluster by music preference, we would discover segments with very specific awareness and knowledge of various artists. Again, the music preference groupings would be localized within different subspaces anchored by the more popular artists within that genre. Market baskets appear much the same with each filled with the staples and then those few items that differentiate among segments (e.g., who buys adult diapers?). In each of these cases, as with feature usage and product familiarity, we are forced to collect information across a wide range of measures because each cluster requires its own set of variables to distinguish itself from the others.
These clusters have been created by powerful forces that are stable over time: major events (e.g., moving out on your own, getting married, buying a house, having a child or retiring) and not so major events (e.g., clothes for work, devices to connect to the internet, or what to buy for dinner). Situational needs and social constraints focus one's attention so that any single individual can become familiar with only a small subset of all that we need to measure in order to construct a complete partition. Your fellow cluster members are others who find themselves in similar circumstances and resolve their conflict in much the same way.
As a result, the data matrix becomes high dimensional with many columns, but the rows are sparse with only a few columns of any intensity for any particular individual. We can try to extend the mixture model so that we can maintain model-based clustering with high-dimensional data (e.g., subspace clustering using the R package HDclassif). The key is to concentrate on the smaller intrinsic dimensionality responsible for specific cluster differences.
Yet, I would argue that nonnegative matrix factorization (NMF) might offer a more productive approach. This blog is filled with posts demonstrating how well NMF works with marketing data, which is reassuring. More importantly, the NMF decomposition corresponds closely with how products are represented in human cognition and memory and how product information is shared through social interactions and marketing communications.
Human decision making adapts to fit the demands of the problem task. In particular, what works with side-by-side comparisons across a handful of attributes for two or three alternatives in a consideration set will not fill our market baskets or help us select a meal from a long list of menu items. This was Herbert Simon's insight. Consumer segments are formed as individuals come to share a common understanding of what is available and what should be preferred. In order to make a choice, we are required to focus our attention on a subspace of all that is available. NMF mimics this simplification process, yielding interpretable building blocks as we attempt to learn the why of consumption.