Thursday, August 6, 2015

Matrix Factorization Comes in Many Flavors: Components, Clusters, Building Blocks and Ideals

Unsupervised learning is covered in Chapter 14 of The Elements of Statistical Learning. Here we learn about several data reduction techniques including principal component analysis (PCA), K-means clustering, nonnegative matrix factorization (NMF) and archetypal analysis (AA). Although on the surface they seem so different, each is a data approximation technique using matrix factorization with different constraints. We can learn a great deal if we compare and contrast these four major forms of matrix factorization.

Robert Tibshirani outlines some of these interconnections in a group of slides from one of his lectures. If there are still questions, Christian Thurau's YouTube video should provide the answers. His talk is titled "Low-Rank Matrix Approximations in Python," yet the only Python you will see is a couple of function calls that look very familiar. R, of course, has many ways of doing K-means and principal component analysis. In addition, I have posts showing how to run nonnegative matrix factorization and archetypal analysis in R.

As a reminder, supervised learning also attempts to approximate the data, in this case the Ys given the Xs. In multivariate multiple regression, we have many dependent variables so that both Y and B are matrices instead of vectors. The usual equation remains Y = XB + E, except that Y, B and E are all matrices with as many rows as the number of observations and as many columns as the number of outcome variables. The error is made as small as possible as we try to reproduce our set of dependent variables as closely as possible from the observed Xs.

K-means and PCA

Without predictors we lose our supervision and are left to search for redundancies or patterns in our Ys without any Xs. We are free to test alternative data generation processes. For example, can variation be explained by the presence of clusters? As shown in the YouTube video and the accompanying slides from the presentation, the data matrix (V) can be reproduced by the product of a cluster membership matrix (W) and a matrix of cluster centroids (H). Each row of W contains all zeros except for a single one that stamps out that cluster profile. With K-means, for instance, cluster membership is all-or-none with each cluster represented by a complete profile of averages calculated across every object in the cluster. The error is the extent that the observations in each grouping differs from their cluster profile.

Principal component analysis works in a similar fashion, but now the rows of W are principal component scores and H holds the principal component loadings. In both PCA and K-means, V = WH but with different constraints on W and H. W is no longer all zeros except for a single one, and H is not a collection of cluster profiles. Instead, H contains the coefficients defining an orthogonal basis for the data cloud with each successive dimension accounting for a decreasing proportion of the total variation, and W tells us how much each dimension contributes to the observed data for every observation.

An early application to intelligence testing serves as a good illustration. Test scores tend to be correlated positively so that all the coefficients in H for the first principal component will be positive. If the tests include more highly intercorrelated verbal or reading scores along with more highly intercorrelated quantitative or math scores, then the second principal component will be bipolar with positive coefficients for verbal variables and negative coefficients for quantitative variables. You should note that the signs can be reversed for any row of H for such reversal only changes direction. Finally, W tells us the impact of each principal component on the observed test scores in data matrix V.

Smart test takers have higher first principal components that uniformly increase all the scores. Those with higher verbal than quantitative skills will also have higher positive values for their second principal component. Given its bipolar coefficients, this will raise the scores on the verbal test and lower the scores on the quantitative tests. And that is how PCA reproduces the observed data matrix.

We can use the R package FactoMineR to plot the features (columns) and objects (rows) in the same space. The same analysis can be performed using the biplot function in R, but FactoMineR offers much more and supports it all with documentation. I have borrowed these two plot from an earlier post, Using Biplots to Map Cluster Solutions.

FactoMineR separates the variables and the individuals in order not to overcrowd the maps. As you can see from the percent contributions of the two dimensions, this is the same space so that you can overlay the two plots (e.g., the red data points are those with the highest projection onto the Floral and Sweetness vectors). One should remember that vector spaces are shown with arrows, and scores on those variables are reproduced as orthogonal projections onto each vector.

The prior post attempted to show the relationship between a cluster and a principal component solution. PCA relies on a "new" dimensional space obtained through linear combinations of the original variables. On the other hand, clusters are a discrete representation. The red points in the above individual factor map are similar because they are of the same type with any differences among these red dots due to error. For example, sweet and sour (medicinal on the plot) are taste types with their own taste buds. However, sweet and sour are perceived as opposites so that the two clusters can be connected using a line with sweet-and-sour tastes located between the extremes. Dimensions always can be reframed as convex combinations of discrete categories, rendering the qualitative-quantitative distinction somewhat less meaningful.

NMF and AA

It may come as no surprise to learn that nonnegative matrix factorization, given it is nonnegative, has the same form with all the elements of V, W, and H constrained to be zero or positive. The result is that W becomes a composition matrix with nonzero values in a row picking the elements of H as parts of the whole being composed. Unlike PCA where H may represent contrasts of positive and negative variable weights, H can only be zero or positive in NMF. As a result, H bundles together variables to form weighted composites.

The columns of W and the rows of H represent the latent feature bundles that are believed to be responsible for the observed data in V. The building blocks are not individual features but weighted bundles of features that serve a common purpose. One might think of the latent bundles using a "tools in the toolbox" metaphor. You can find a detailed description showing each step in the process in a previous post and many examples with the needed R code throughout this blog.

Archetypal analysis is another variation on the matrix factorization theme with the observed data formed as convex combinations of extremes on the hull that surrounds the point cloud of observations. Therefore, the profiles of these extremes or ideals are the rows of H and can be interpreted as representing opposites at the edge of the data cloud. Interpretation seems to come naturally since we tend to think in terms of contrasting ideals (e.g., sweet-sour and liberal-conservative).

This is the picture used in my original archetypal analysis post to illustrate the point cloud, the variables projected as vectors onto the same space, and the locations of the 3 archetypes (A1, A2, A3) compared with the placement of the 3 K-means centroids (K1, K2, K3). The archetypes are positioned as vertices of a triangle spanning the two-dimensional space with every point lying within this simplex. In contrast, the K-means centroids are pulled more toward the center and away from the periphery.
Why So Many Flavors of Matrix Factorization?

We try to make sense of our data by understanding the underlying process that generated that data. Matrix factorization serves us well as a general framework. If every variable was mutually independent of all the rest, we would not require a matrix H to extract latent variables. Moreover, if every latent variable had the same impact for every observation, we would not require a matrix W holding differential contributions. The equation V = WH represents that the observed data arises from two sources: W that can be interpreted as if it were a matrix of latent scores and H that serves as a matrix of latent loadings. H defines the relationship between observed and latent variables. W represents the contributions of the latent variables for every observation. We call this process matrix factorization or matrix decomposition for obvious reasons.

Each of the four matrix factorizations adds some type of constraint in order to obtain a W and H. Each constraint provides a different view of the data matrix. PCA is a variance-maximizers yielding a set of components accounting for the most variation independent of all preceding components. K-means gives us boxes with minimum variation within each box. We get building blocks and individualized rules of assembly from NMF. Finally, AA frames observations as compromises among ideals or archetypes. The data analyst must decide which story best fits their data.

Monday, August 3, 2015

Sensemaking in R: A Plenitude of Models Makes for Good Storytelling

"Sensemaking is a motivated, continuous effort to understand connections (which can be among people, places, and events) in order to anticipate their trajectories and act effectively."
Making Sense of Sensemaking 1 (2006)

Story #1: A Tale of Causal Links

A causal model can serve as a sensemaking tool. I have reproduced below a path diagram from an earlier post organizing a set of customer ratings based on their hypothesized causes and effects. As shown on the right side of the graph, satisfaction comes first and loyalty follows with input from image and complaints. Value and Quality perceptions are positioned as drivers of satisfaction. Image seems to be separated from product experience and causally prior. Of course, you are free to disagree with the proposed causal structure. All I ask is that you "see" how such a path diagram can be imposed on observed data in order to connect the components and predict the impact of marketing interventions.

Actually, the nodes are latent variables, and I have not drawn in the measurement model. The typical customer satisfaction questionnaire has many items tapping each construct. In my previous post referenced above, I borrowed the mobile phone dataset from the R package semPLS, where loyalty was assessed with three ratings: continued usage, switching to lower price competitor, and likelihood to recommend. These items are seen as indicators of a commitment and attachment, and their intercorrelations are due to their common cause, which we have labeled as Loyalty.

Where Do Causal Models Come From? The data were collected at one point in time, but it is difficult not to impose a learning sequence on the ratings. That is, the analyst overlays the formation process onto the data as if the measurements were made as learning occurred. Brand image is believed to be acquired first and expectation thought to be formed before the purchase is made. Product experience is understood to come next in the sequence, followed by an evaluation and finally the loyalty decisions to continue using and recommend to others. 

As I argued in the prior post, causation is not in the data because the ratings were not gathered over time. By the time the questionnaire is seen, dissonance has already worked its way backward creating consistencies in the ratings. For instance, when switching is a chore, satisfaction and product perceptions are all higher than they would have been had changing providers been an easier task. In a similar manner, reluctantly recommending only when forced for your opinion may reverse the direction of the arrows and at least temporarily raise all ratings. We shall see in the next section how ratings are interconnected by a network of consumer inferences reflecting not observed covariation but belief and semantics.

Story #2: Living on a One-Dimensional Love-Hate Manifold (Halo Effects)

Our first sensemaking tool, structural equation modeling, was shaped by an intricate plot with many characters playing fixed causal roles. Few believe that this is the only way to make sense of the connections among the different ratings. For some, including myself, the causal model seems a bit too rational. What happened to affect? Halo effects are thought of as a cognitive bias, but all summaries introduce bias measured by the variation about the centroid. In the case of customer satisfaction and loyalty, a pointer on a single evaluative dimension can reproduce all the ratings. You tell me that you are very satisfied with your mobile phone provider, and I can predict that you are not dropping a lot of calls.

The halo effect functions as a form of data comprehension. We learn what constitutes a "good" product or service before we buy. These are the well-formed performance expectations that serve as the tests for deciding satisfaction. We are upset when the basic functions that are must-haves are not delivered (e.g., failure of our mobile phone to pair with the car's Bluetooth), and we are delighted when extras are included that we did not expect (e.g., responsive customer support). Most of these expectations lie just below awareness until experienced (e.g., breakage and repair costs when dropped short distance or onto relatively soft surface).

This representation orders features and services as milestones along a single dimension so that one can read one's overall satisfaction from their position along this path. You may be familiar with the usage of such sensemaking tools in measuring achievement (e.g., spelling ability is assessed by the difficulty of words that one can spell) or political ideology (e.g., a legislator's position along the liberal-conservative continuum depends on the bills voted for and against). Thus, I assess your spelling ability by the difficulty of the words you can spell. I determine how liberal or conservative you are by the issues you support or oppose. And I evaluate brands and their products by the features and services they are able to provide. We simply reanalyze the same customer satisfaction rating data. The graded response model from the ltm R package will order both customers and the rating items along the same latent satisfaction dimension, as shown in my post Item Response Modeling of Customer Satisfaction.

Perhaps you noticed that we have changed our perspective or shifted to a new paradigm. Feature ratings are no longer drivers of satisfaction, instead they have become indicators of satisfaction. In Story #1, a Tale of Causal Links, the arrows go from the features to satisfaction and loyalty. Driver analysis accumulates satisfaction feature by feature with each adding a component to the overall reservoir of goodwill. However, in Story #2 all the ratings (features, satisfaction, and loyalty) fall along the same evaluative continuum from rage to praise. We can still display the interrelationship with a diagram, thought we need to drop the arrows for everything is interconnected in this network.

The manifold from Story #2 makes sense of the data by ranking features based on performance expectations. Some features and services are basic and everyone scores well. The premium features and services, on the other hand, are those not provided by every product. Customers decide what they want and are willing to pay, and then they assess the ability of the purchased product to deliver. This is not a driver analysis for the assessment of each component is not independent of the other components.

Those of us willing to live with the imperfections of our current product tend to rate the product higher in a backward adjustment from loyalty to feature performance. You do something similar when you determine that switching is useless because all the competitors are the same. Can I alter your perceptions by tempting you with a $100 bonus or a free month of service to recommend a friend? It's a network of jointly determined nodes with a directionality represented by the love-hate manifold. The ability to generate satisfaction or engender loyalty is but another node, different from product quality perceptions, yet still part of the network.

How else can you explain how randomly attaching a higher price to a bottle of wine yields higher ratings for taste? Price changes consumer perceptions of quality because consumers make inferences about uncertain features based on what they know about more familiar features. When asked about customer support, you can answer even if you have never contacted or used customer support. You simply fill in a rating with an inference from other features with which you are more familiar or you simply assume it must be good or bad because you are happy or unhappy overall. Such a network analysis can be done with R, as can the driver analysis from our first story.