Thursday, August 6, 2015

Matrix Factorization Comes in Many Flavors: Components, Clusters, Building Blocks and Ideals

Unsupervised learning is covered in Chapter 14 of The Elements of Statistical Learning. Here we learn about several data reduction techniques including principal component analysis (PCA), K-means clustering, nonnegative matrix factorization (NMF) and archetypal analysis (AA). Although on the surface they seem so different, each is a data approximation technique using matrix factorization with different constraints. We can learn a great deal if we compare and contrast these four major forms of matrix factorization.

Robert Tibshirani outlines some of these interconnections in a group of slides from one of his lectures. If there are still questions, Christian Thurau's YouTube video should provide the answers. His talk is titled "Low-Rank Matrix Approximations in Python," yet the only Python you will see is a couple of function calls that look very familiar. R, of course, has many ways of doing K-means and principal component analysis. In addition, I have posts showing how to run nonnegative matrix factorization and archetypal analysis in R.

As a reminder, supervised learning also attempts to approximate the data, in this case the Ys given the Xs. In multivariate multiple regression, we have many dependent variables so that both Y and B are matrices instead of vectors. The usual equation remains Y = XB + E, except that Y, B and E are all matrices with as many rows as the number of observations and as many columns as the number of outcome variables. The error is made as small as possible as we try to reproduce our set of dependent variables as closely as possible from the observed Xs.

K-means and PCA

Without predictors we lose our supervision and are left to search for redundancies or patterns in our Ys without any Xs. We are free to test alternative data generation processes. For example, can variation be explained by the presence of clusters? As shown in the YouTube video and the accompanying slides from the presentation, the data matrix (V) can be reproduced by the product of a cluster membership matrix (W) and a matrix of cluster centroids (H). Each row of W contains all zeros except for a single one that stamps out that cluster profile. With K-means, for instance, cluster membership is all-or-none with each cluster represented by a complete profile of averages calculated across every object in the cluster. The error is the extent that the observations in each grouping differs from their cluster profile.

Principal component analysis works in a similar fashion, but now the rows of W are principal component scores and H holds the principal component loadings. In both PCA and K-means, V = WH but with different constraints on W and H. W is no longer all zeros except for a single one, and H is not a collection of cluster profiles. Instead, H contains the coefficients defining an orthogonal basis for the data cloud with each successive dimension accounting for a decreasing proportion of the total variation, and W tells us how much each dimension contributes to the observed data for every observation.

An early application to intelligence testing serves as a good illustration. Test scores tend to be correlated positively so that all the coefficients in H for the first principal component will be positive. If the tests include more highly intercorrelated verbal or reading scores along with more highly intercorrelated quantitative or math scores, then the second principal component will be bipolar with positive coefficients for verbal variables and negative coefficients for quantitative variables. You should note that the signs can be reversed for any row of H for such reversal only changes direction. Finally, W tells us the impact of each principal component on the observed test scores in data matrix V.

Smart test takers have higher first principal components that uniformly increase all the scores. Those with higher verbal than quantitative skills will also have higher positive values for their second principal component. Given its bipolar coefficients, this will raise the scores on the verbal test and lower the scores on the quantitative tests. And that is how PCA reproduces the observed data matrix.

We can use the R package FactoMineR to plot the features (columns) and objects (rows) in the same space. The same analysis can be performed using the biplot function in R, but FactoMineR offers much more and supports it all with documentation. I have borrowed these two plot from an earlier post, Using Biplots to Map Cluster Solutions.

FactoMineR separates the variables and the individuals in order not to overcrowd the maps. As you can see from the percent contributions of the two dimensions, this is the same space so that you can overlay the two plots (e.g., the red data points are those with the highest projection onto the Floral and Sweetness vectors). One should remember that vector spaces are shown with arrows, and scores on those variables are reproduced as orthogonal projections onto each vector.

The prior post attempted to show the relationship between a cluster and a principal component solution. PCA relies on a "new" dimensional space obtained through linear combinations of the original variables. On the other hand, clusters are a discrete representation. The red points in the above individual factor map are similar because they are of the same type with any differences among these red dots due to error. For example, sweet and sour (medicinal on the plot) are taste types with their own taste buds. However, sweet and sour are perceived as opposites so that the two clusters can be connected using a line with sweet-and-sour tastes located between the extremes. Dimensions always can be reframed as convex combinations of discrete categories, rendering the qualitative-quantitative distinction somewhat less meaningful.

NMF and AA

It may come as no surprise to learn that nonnegative matrix factorization, given it is nonnegative, has the same form with all the elements of V, W, and H constrained to be zero or positive. The result is that W becomes a composition matrix with nonzero values in a row picking the elements of H as parts of the whole being composed. Unlike PCA where H may represent contrasts of positive and negative variable weights, H can only be zero or positive in NMF. As a result, H bundles together variables to form weighted composites.

The columns of W and the rows of H represent the latent feature bundles that are believed to be responsible for the observed data in V. The building blocks are not individual features but weighted bundles of features that serve a common purpose. One might think of the latent bundles using a "tools in the toolbox" metaphor. You can find a detailed description showing each step in the process in a previous post and many examples with the needed R code throughout this blog.

Archetypal analysis is another variation on the matrix factorization theme with the observed data formed as convex combinations of extremes on the hull that surrounds the point cloud of observations. Therefore, the profiles of these extremes or ideals are the rows of H and can be interpreted as representing opposites at the edge of the data cloud. Interpretation seems to come naturally since we tend to think in terms of contrasting ideals (e.g., sweet-sour and liberal-conservative).

This is the picture used in my original archetypal analysis post to illustrate the point cloud, the variables projected as vectors onto the same space, and the locations of the 3 archetypes (A1, A2, A3) compared with the placement of the 3 K-means centroids (K1, K2, K3). The archetypes are positioned as vertices of a triangle spanning the two-dimensional space with every point lying within this simplex. In contrast, the K-means centroids are pulled more toward the center and away from the periphery.
Why So Many Flavors of Matrix Factorization?

We try to make sense of our data by understanding the underlying process that generated that data. Matrix factorization serves us well as a general framework. If every variable was mutually independent of all the rest, we would not require a matrix H to extract latent variables. Moreover, if every latent variable had the same impact for every observation, we would not require a matrix W holding differential contributions. The equation V = WH represents that the observed data arises from two sources: W that can be interpreted as if it were a matrix of latent scores and H that serves as a matrix of latent loadings. H defines the relationship between observed and latent variables. W represents the contributions of the latent variables for every observation. We call this process matrix factorization or matrix decomposition for obvious reasons.

Each of the four matrix factorizations adds some type of constraint in order to obtain a W and H. Each constraint provides a different view of the data matrix. PCA is a variance-maximizers yielding a set of components accounting for the most variation independent of all preceding components. K-means gives us boxes with minimum variation within each box. We get building blocks and individualized rules of assembly from NMF. Finally, AA frames observations as compromises among ideals or archetypes. The data analyst must decide which story best fits their data.

1 comment:

  1. Nice way to show the relationships. The original algorithm for PCA I believe was NIPALS, which is a matrix factorization using alternating least squares, and there are more recently published methods for doing weighted-PCA (or maximum likelihood or total least squares) and weighted non-negative matrix factorization using the same underlying alternating least squares algorithm, with the difference being whether or not orthogonality is imposed on the components extracted (see Wentzell et al, 1997, and Wentzell et al 2006