Thursday, July 10, 2014

How Much Can We Learn from Top Rankings using Nonnegative Matrix Factorization?

Purchases are choices from available alternatives. Post-purchase, we know what is the most preferred, but all the other options score the same. Regardless of differences in appeal, all the remaining items received the same score of not chosen. A second choice tells us more, as would the alternative selected as third most preferred. As we add top rankings from first to second to the kth choice, we seem to gain more and more information about preferences. Yet, what if we concentrated only on the top performers, what might be called the "pick of the litter" or the "top of the heap" (e.g., top k from J alternatives)? How much can be learn from such partial rankings?

Jan de Leeuw shows us what can be done with a complete ranking. What if we were to take de Leeuw breakfast food dataset and keep only the top-3 rankings so that all we know is what each respondent selected as their first, second and third choices? Everything that you would need to know is contained in the Journal of Statistical Software article by de Leeuw and Mair (see section 6.2). The data come in a matrix with 42 individuals and 15 breakfast foods. I have reproduce his plot below to make the discussion easier to follow. Please note that all the R code can be found at the end of this post.

The numbers running from 1 to 42 represent the location of each individual ranking the 15 different breakfast foods. That is, rows are individuals, columns are foods, and the cells are rankings from 1 to 15 for each row. What would you like for breakfast? Here are 15 breakfast foods, please order them in terms of your preference with "1" being your most preferred food and "15" indicating your least preferred.

The unfolding model locates each respondent's ideal and measures preference as distance from that ideal point. Thus, both rows (individuals) and columns (foods) are points that are positioned in the same space such that the distances between any given row number and the columns have the same ordering as the original data for that row. As a result, you can reproduce (approximately) an individual's preference ordering from the position of their ideal point relative to the foods. Who likes muffins? If you answered, #23 or #34 or #33 or anyone else nearby, then you understand the unfolding map.

Now, suppose that only the top-3 rankings were provided by each respondent. We will keep the rankings for first, second and third and recode everything else to zero. Now, what values should be assigned to the first, second and third picks? Although ranks are not counts, it is customary to simply reverse the ranks so that the weight for first is 3, second is 2, and third is 1. As a result, the rows are no longer unique values of 1 to 15, but instead contain one each of 1, 2 and 3 plus 12 zeroes. We have wiped out 80% of our data. Although there are other approaches for working with partial rankings, I will turn to nonnegative matrix factorization because I want a technique that works well with sparsity, for example, top 3 of 50 foods or top 5 of 100 foods. Specifically, we are seeking a general approach for dealing with any partial ranking that generates sparse data matrices. Nonnegative matrix factorization seems to be up for the task, as demonstrated in a large food consumption study.

We are now ready for the nmf R package as soon as we specify the number of latent variables. I will try to keep it simple. The data matrix is 42 x 15 with each row having 12 zeroes and three entries that are 1, 2 and 3 with 3 as the best (ranking reversed). Everything would be simpler if the observed breakfast food rankings resulted from a few latent consumption types (e.g., sweet-lovers tempted by pastries, the donuts-for-breakfast crowd, the muffin-eaters and the bread-slicers). Then, observed rankings could be accounted for by some combination of these latent types. "Pure Breads" select only toast or hard roll. "Pure Muffins" pick only the three varieties of muffin, though corn muffin may not be considered a real muffin by everyone. Coffee cakes may be its own latent type, and I have idea how nmf will deal with cinnamon toast (remember that the data is at least 40 years old). From these musings one might reasonably try three or four latent variables.

The nonnegative matrix factorization (nmf) was run with four latent variables. The first function argument is the data matrix, followed by the rank or number of latent variables, with the method next, and a number indicating the number of times you want the analysis rerun with different starting points. This last nrun argument works in the same manner as the nstart argument in kmeans. Local minimum can be a problem, so why not restart the nmf function with several different initialization and pick the best solution? The number 10 seemed to work with this data matrix, by which I mean that I obtained similar results each time I reran the function with nrun=10. You will note that I did not set the seed, so that you can try it yourself and see if you get a similar solution.

The coefficient matrix is shown below. The entries have been rescaled to fall along a scale from 0 to 100 for no other reason than it is relative value that is important and marketing research often uses such a thermometer scale. Because I will be interpreting these coefficients as if they were factor loadings, I borrowed the fa.sort() function from the psych R package. Hopefully, this sorting make it easier to see the underlying pattern.

Obviously, these coefficients are not factor loadings, which are correlations between the observed and latent variables. You might want to think of them as if they were coefficients from a principal component analysis. What are these coefficients? You might wish to recall that we are factoring our data matrix into two parts: this coefficient matrix and what is called a basis matrix. The coefficient matrix enables us to name the latent variables by seeing the association between the observed and latent variables. The basis matrix includes a row for every respondent indicating the contribution of each latent variable to their top 3 rankings. I promise that all this will become clearer as we work through this example.

Coffee Cake Muffin Pastry Bread
cofcake 70 1 0 0
cornmuff 2 0 0 2
engmuff 0 38 0 4
bluemuff 2 36 5 0
cintoast 0 7 0 3
danpastry 1 0 100 0
jdonut 0 0 25 0
gdonut 8 0 20 0
cinbun 0 6 20 0
toastmarm 0 0 12 10
toast 0 0 2 0
butoast 0 3 0 51
hrolls 0 0 2 22
toastmarg 0 1 0 14
butoastj 2 0 7 10

These coefficients indicate the relative contribution of each food. The columns are named as one would name a factor or a principal component or any other latent variable. That is, we know what a danish is and a glazed or jelly donut, but we know nothing about the third column except that interest in these three breakfast foods seem to covary together. Pastry seemed like a good, although not especially creative, name. These column names seem to correspond to the different regions in the joint configuration plot derived from the complete rankings. In fact, I borrowed de Leeuw's cluster names from the top of his page 20.

And what about the 42 rows in the basis matrix? The nmf package relies on a heatmap to display the relationship between the individuals and the latent variables.

Interpretation is made easier by the clustering of the respondents along the left side of the heatmap. We are looking for blocks of solid color in each column, for example, the last 11 rows or the 4 rows just above the last 11 rows. The largest block falls toward the middle of the third column associated with pastries, and the first several rows tend to have their largest values in the first column. although most have membership in more than one column. The legend tells us that lighter yellows indicate the lowest association with the column and the darkest reds or browns identify the strongest connection. The dendrogram divides the 42 individuals into the same groupings if you cut the tree at 4 clusters.

The dendrogram also illustrates that some of the rows are combinations of more than one type. The whole, meaning the 42 individuals, can be separated into four "pure" types. A pure type is an individual whose basis vector contains one value very near one and the remaining values very near zero. Everyone is a combination of the pure types or latent variables. Some are all pure types, and some are mixtures of different types. The last 4 rows are a good example of a mixture of muffins and breads (columns 4 and 2).

Finally, I have not compared the location of the respondents on the configuration plot with their color spectrum in the heatmap. There is a correspondence, for example, #37 is near the breads on the plot and in the bread column on the heatmap. And we could continue with #19 into pastries and #33 eating muffins, but we will not since one does not expect complete agreement when the heatmap has collapsed the lower 80% of the rankings. We have our answer to the initial question raised in the title. We can learn a great deal about attraction using only the top rankings. However, we have lost any avoidance information contained in the complete rankings.

So, What Is Nonnegative Matrix Factorization?

I answered this question at the end of a previous post, and it might be helpful for you to review another example. I show in some detail the equation and how the coefficient matrix and the basis matrix combine to yield approximations of the observed data.

What do you want for breakfast? Is it something light and quick, or are you hungry and want something filling? We communicate in food types. A hotel might advertise that their price includes a continental breakfast. Continental breakfast is a food type. Bacon and eggs are not included. This is the structure shaping human behavior that nonnegative matrix factorization attempts to uncover. There were enough respondents who wanted only the foods from each of the four columns that we were able to extract four breakfast food types. These latent variables are additive so that a respondent can select according to their own individual proportions how much they want the foods from each column.

Nonnegative matrix factorization will succeed to the extent that preferences are organized as additive groupings of observed choices. I would argue that a good deal of consumption is structured by goals and that these latent variables reflect goal-derived categories. We observe the selections made by individuals and infer their motivation. Those inferences are the columns of our coefficient matrix, and the rows of the heatmap tell us how much each respondent relies on those inferred latent constructs when making their selections.

R code needed to recreate all the tables and plots:

res <- smacofRect(breakfast)
plot(res, plot.type = "confplot")
apply(breakfast, 2, table)
apply(partial_rank, 2, table)
fit<-nmf(partial_rank, 4, "lee", nrun=10)

Created by Pretty R at


  1. Excellent post! Can you explain the last paragraph in a bit more depth? I didn't understand the concept of additive grouping of observed choices! Also can nmf then be used for lets say problems like rating movies/services where the options can be varied but getting it narrowed to Top 5 would help understand the preferences of the consumer?

    1. Thanks for the positive feedback and the good question. In my previous post, Identifying Pathways in the Consumer Decision Journey: Nonnegative Matrix Factorization, I provided a link to the Seung and Lee 1999 article "Learning the Parts of Objects by NMF". Their example of how images of faces can be decomposed into parts ought to help explain the point I was trying to make. I plan to publish another post soon that will elaborate on how benefits are the building blocks for the construction of feature preference (e.g., muffin = filling + tasty + not too sweet).

      In response to your second question, NMF can be used with any nonnegative data matrix, especially counts but also any intensity measure with a zero indicating absence. Because it handles sparse data well, it is useful with data set that have lots of zeroes (e.g., touchpoints or movie ratings or top of the heap rankings). One can be creative. Start with a consideration set and allocate ten points across the objects in the set. However, one wants to be careful to embed the data collection within realistic marketing contexts so that we are measuring what occurs rather than playing some game that cannot be generalized.

  2. great post! can you attach the breakfast dataset to recreate de code?

    1. The breakfast data comes with the smacof package. When you install smacof in R, the breakfast data will be available. I should have mentioned in the post that smacof must be installed. Thanks for pointing this out.

  3. Dear Joel,
    great post.
    In your post you selected four factors. Is there a common way to decide between solutions with different numbers of factors (like e.g. scree plot in standard FA)?

    1. One is searching for a meaningful decomposition that "explains" the data. These four make sense to me. Scree plots can be useful, but my goal is not to reproduce this particular data set or any other data set collected using some other methodology. That is, as one adds latent variables (e.g., factors, dimensions in MDS, or clusters), the residual variation is reduced and this observed data gathered in a particular way is reproduced more accurately.

      Instead, we are looking for components that will generalize over data collection procedure. Bread, Pastry and Muffin work for me as a means for organizing breakfast preferences (with coffee cake by itself). However, I would have preferred a data set with more alternatives that might more accurately reflect all the options available for breakfast.