Friday, June 13, 2014

Identifying Pathways in the Consumer Decision Journey: Nonnegative Matrix Factorization

The Internet has freed us from the shackles of the yellow page directory, the trip to the nearby store to learn what is available, and the forced choice among a limited set of alternatives. The consumer is in control of their purchase journey and can take any path they wish. But do they? It's a lot of work for our machete-wielding consumer cutting their way through the product jungle. The consumer decision journey is not an itinerary, but neither is it aimless meandering. Perhaps we do not wish to follow the well-worn trail laid out by some marketing department. The consumer is free to improvise, not by going where no one has gone before, but by creating personal variation using a common set of journey components shared with others.

Even with all the different ways to learn about products and services, we find constraints on the purchase process with some touchpoint combinations more likely than others. For example, one could generate a long list of all the possible touchpoints that might trigger interest, provide information, make recommendations, and enable purchase. Yet, we would expect any individual consumer to encounter only a small proportion of this long list. A common journey might be no more than a seeing an ad followed by a trip to a store. For frequently purchased products, the entire discovery-learning-comparing-purchase process could collapse into a single point-of-sale (PoS) touchpoint, such as product packaging on a retail shelf.

The figure below comes from a touchpoint management article discussing the new challenges of online marketing. This example was selected because it illustrates how easy it is to generate touchpoints as we think of all the ways that a consumer interacts with or learns about a product. Moreover, we could have been much more detailed because episodic memory allows us to relive the product experience (e.g., the specific ads seen, the packaging information attended to, the pages of the website visited). The touchpoint list quickly gets lengthy, and the data matrix becomes sparser because an individual consumer is not likely to engage intensively with many products. The resulting checklist dataset is a high-dimensional consumer-by-touchpoint matrix with lots of columns and cells containing some ones but mostly zeroes.

It seems natural to subdivide the columns into separate modes of interaction as shown by the coloring in the above figure (e.g., POS, One-to-One, Indirect, and Mass). It seems natural because different consumers rely on different modes to learn and interact with product categories. Do you buy by going to the store and selecting the best available product, or do you search and order online without any physical contact with people or product? Like a Rubik's cube, we might be able to sort rows and columns simultaneously so that the reordered matrix would appear to be block diagonal with most of the ones within the blocks and most of the zeroes outside. You can find an illustration in a previous post on the reorderable data matrix. As we shall see later, nonnegative matrix factorization "reorders" indirectly by excluding negative entries in the data matrix and its factors. A more direct approach to reordering would use the R packages for biclustering or seriation. Both of these links offer different perspectives on how to cluster or order rows and columns simultaneously.

Nonnegative Matrix Factorization (NMF) with Simulated Data

I intend to rely on the R package NMF and a simulated data set based on the above figure. I will keep it simple and assume only two pathways: an online journey through the 10 touchpoints marked with an "@" in the above figure and an offline journey through the remaining 20 touchpoints. Clearly, consumers are more likely to encounter some touchpoints more often than others, so I have made some reasonable but arbitrary choices. The R code at the end of this post reveals the choices that were made and how I generated the data using the sim.rasch function from the psych R package. Actually, all you need to know is that the dataset contains 400 consumers, 200 interacting more frequently online and 200 with greater offline contact. I have sorted the 30 touchpoints from the above figure so that the first 10 are online (e.g., search engine, website, user forum) and the last 20 are offline (e.g., packaging information, ad in magazine, information display). Although the patterns within each set of online and offline touchpoints are similar, the result is two clearly different pathways as shown by the following plot.

It should be noted that the 400 x 30 data matrix contained mostly zeroes with only 11.2% of the 12,000 cells indicating any contact. Seven of the respondents did not indicate any interaction at all and were removed from the analysis. The mode was 3 touchpoints per consumer, and no one reported more than 11 interactions (although the verb "reported" might not be appropriate to describe simulated data).

If all I had was the 400 respondents, how would I identify the two pathways? Actually, k-means often does quite well, but not in this case with so many infrequent binary variables. Although using the earlier mentioned biclustering approach in R, Dolnicar and her colleagues will help us understand the problems encounters when conducting market segmentation with high-dimensional data. When asked to separate the 400 into two groups, k-means clustering was able to identify correctly only 55.5% of the respondents. Before we overgeneralize, let me note that k-means performed much better when the proportions were higher (e.g., raise both lines so that they peak above 0.5 instead of below 0.4), although that is not much help with high-dimensional scare data.

And, what about NMF? I will start with the results so that you will be motivated to remain for the explanation in the next section. Overall, NMF placed correctly 81.4% of the respondents, 85.9% of the offline segment and 76.9% of the online segment. In addition, NMF extracted two latent variables that separated the 30 touchpoints into the two sets of 10 online and 20 offline interactions.

So, what is nonnegative matrix factorization?

Have you run or interpreted a factor analysis? Factor analysis is matrix factorization where the correlation matrix R is factored into factor loadings: R = FF'. Structural equation modeling is another example of matrix factorization, where we add direct and indirect paths between the latent variables to the factor model connecting observed and latent variables. However, unlike the two previous models that factor the correlation or variance-covariance matrix among the observed variables, NMF attempts to decompose the actual data matrix.

Wikipedia uses the following diagram to show this decomposition or factorization:

The matrix V is our data matrix with 400 respondents by 30 touchpoints. A factorization simplifies V by reducing the number of columns from the 30 observed touchpoints to some smaller number of latent or hidden variables (e.g., two in our case since we have two pathways). We need to rotate the H matrix by 90 degrees so that it is easier to read, that is, 2x30 to 30x2. We do this by taking the transpose, which in R code is t(H).

Search engine
Price comparison
Hint from Expert
User forum
Banner or Pop-up
E-mail request
Packaging information
PoS promotion
Recommendation friends
Show window
Information at counter
Advertising entrance
Editorial newspaper
Consumer magazine
Ad in magazine
Personal advice
Information screen
Information display
Customer magazine
Catalog loyalty program
Offer loyalty card
Service hotline

As shown above, I have labeled the columns to reflect their largest coefficients in the same way that one would name a factor in terms of its largest loadings. To continue with the analogy to factor analysis, the touchpoints in V are observed, but the columns of W and the rows of H are latent and named using their relationship to the touchpoints. Can we call these latent variables "parts," as Seung and Lee did in their 1999 article "Learning the Parts of Objects by NMF"? The answer depends on how much overlap between the columns you are willing to accept. When each row of H contains only one large positive value and the remaining columns for that row are zero (e.g., Website in the third row), we can speak of latent parts in the sense that adding columns does not change the impact of previous columns but simply adds something new to the V approximation.

So in what sense is online or offline behavior a component or a part? There are 30 touchpoints. Why are there not 30 components? In this context, a component is a collection of touchpoints that vary together as a unit. We simulated the data using two different likelihood profiles. The argument called d in the sim.rasch function (see the R code at the end of this post) contains 30 values controlling the likelihood that the 30 touchpoints will be assigned a one. Smaller values of d result in higher probabilities that the touchpoint interaction will occur. The coefficients in each latent variable of H reflect those d values and constitute a component because the touchpoints vary together for 200 individuals. Put another way, the whole with 400 respondents contains two parts of 200 respondents each and each with its own response generation process.

The one remaining matrix, W, must be of size 400x2 (# respondents times # latent variables). So, we have 800 entries in W and 60 cells in H compared to the 12,000 observed values in V. W has one row for each respondent. Here are the rows of W for the 200th and 201st respondents, which is the dividing line between the two segments:
200 0.00015 0.00546
201 0.01218 0.00038
The numbers are small because we are factoring a data matrix of zeroes and ones. But the ratios of these two numbers are sizeable. The 200th respondent has an offline latent score (0.00546) more than 36 times its online latent score (0.00015), and the ratio for the 201st respondent is more than 32 in the other direction with online dominating. Finally, in order to visualize the entire W matrix for all respondents, the NMF package will produce heatmaps like the following with the R code basismap(fit, Rowv=NA).
As before, the first column represent online and the second points to offline. The first 200 rows are offline respondents or our original Segment 1 (labeled basis 2), and the last 200 or our original Segment 2 were generated using the online response pattern (labeled basis 1). This type of relabeling or renumbering occurs over and over again in cluster analysis, so we must learn to live with it. To avoid confusion, I will repeat myself and be explicit.

Basis 2 is our original Segment 1 (Offliners).
Basis 1 is our original Segment 2 (Onliners).

As mentioned earlier, Segment 1 offline respondents had a higher classification accuracy (85.9% vs. 76.9%). This is shown by the more solid and darker red lines for the first 200 offline respondents in the second basis 2 column.

Consumer Improvisation Might Be Somewhat More Complicated

Introducing only two segments with predominantly online or offline product interactions was a simplification necessary to guide the reader through an illustrative example. Obviously, the consumer has many more components that they can piece together on their journey. However, the building blocks are not individual touchpoints but set of touchpoints that are linked together and operate as a unit. For example, visiting a brand website creates opportunities for many different micro-journeys over many possible links on each page. Recurring website micro-journeys experienced by several consumers would be identified as a latent components in our NMF analysis. At least, this what I have found using NMF with touchpoint checklists from marketing research questionnaires.

R Code to Reproduce All the Analysis in this Post
offline<-sim.rasch(nvar=30, n=200, mu=-0.5, sd=0,
online<-sim.rasch(nvar=30, n=200,  mu=-0.5, sd=0,
names(tp)<-c("Search engine",
             "Price comparison",
             "Hint from Expert",
             "User forum",
             "Banner or Pop-up",
             "E-mail request",
             "Packaging information",
             "PoS promotion",
             "Recommendation friends",
             "Show window",
             "Information at counter",
             "Advertising entrance",
             "Editorial newspaper",
             "Consumer magazine",
             "Ad in magazine",
             "Personal advice",
             "Information screen",
             "Information display",
             "Customer magazine",
             "Catalog loyalty program",
             "Offer loyalty card",
             "Service hotline")
seg_profile<-t(aggregate(tp, by=list(segment), FUN=mean))
    max(seg_profile[2:30,])), type="n",
    xlab="Touchpoints (First 10 Online/Last 20 Offline)", 
    ylab="Proportion Experiencing Touchpoint")
lines(seg_profile[2:30,1], col="blue", lwd=2.5)
lines(seg_profile[2:30,2], col="red", lwd=2.5)
       c("Offline","Online"), lty=c(1,1),
       lwd=c(2.5,2.5), col=c("blue","red"))
tp_cluster<-kmeans(tp[rows>0,], 2, nstart=25)
fit<-nmf(tp[rows>0,], 2, "frobenius")

Created by Pretty R at


  1. Hey Joel - I am an editor who is studying to become a data scientist and I noticed a typo in the R program. The fourth to last item in the c() command should be "Voucher," not "Vocher." Very interesting article, by the way. Regards.

  2. Hi, Great post as always. You're one of the few people who describe marketing analysis so well using R. Please keep this up! Cheers!

  3. Hi, would appreciate any comments on:
    1) How to decide how many latent components 'best' describe the data?
    2) Any situations in which NMF may be inappropriate? I'm a little leery of applying techniques I don't have a deep mathematical understanding of to new situations?
    3) The (intuitive) difference between applying NMF to a data matrix and, say, PCA or classical MDS?


    1. Let me answer your last two questions first. NMF has become an established technique, as you can see from all the references on the Wikipedia link. The onlilne textbook "Elements of Statistical Learning Theory" provides a good introduction. Li and Ding have a recent book chapter called "Non-negative Matrix Factorization for Clustering: A Survey" at Nicolas Gillis "The Why and How of Nonnegative Matrix Factorization" at might help. Ankur Moitra has a YouTube video called "New Algorithms for Nonnegative Matrix Factorization and Beyond" with slides at his website.

      Now, deciding how many latent components is best depends on your definition of best. It is decided in the same manner as the number of clusters or number of factors question in cluster and factor analysis. Are additional latent components meaningful and interpretable or are they noise? Let's run several different ranks and take a look. Section 2.6 of the NMF introductory vignette discusses these issues under the heading "Estimating the Factorization Rank" where rank is the number of components.

      I hope this helps. Good luck.

  4. Thanks this is super helpful!

  5. To build on your example, say I want to look at factors associated with purchase likelihood, for example, using a logistic regression. I could have used the individual episodes (original columns of your data matrix) as dependent variables, but this isn't a particularly parsimonious representation and so much detail may obscure the main story (e.g online shoppers have higher purchase likelihood than offline shoppers):

    1) Can I use the columns of the basis matrix as independent variables in a regression?
    2) When running NMF, can you choose to enforce orthogonality to avoid multicollinearity due to correlations between basis columns?

    Thanks in advance!

    1. Yes, NMF can be used as a pre-processing step for future clustering or regression analysis. And yes, one can impose orthogonality constraints on the factors. You can find lots of examples using the basis matrix in the same way as principal component or factor scores. However, orthogonal NMF has been introduced primarily as a means for dealing with the non-uniqueness of the NMF solution and not because the columns of the basis matrix are collinearity.

  6. Hey Joel- Great stuff! I am very confused by what vectors (rows or columns of which matrix (W or H) ) to use for the latent factors, depending on if the input matrix is (for example) documents x terms or terms x documents. I saw your other post but is there any resources you have found that describes this?

    1. You can find a discussion in Chapter 8.1 from a book by David Skillicorn called Understanding Complex Datasets (2007). You may not see that section in Google Books, but there is a book pdf that you can find if you search for it.

      Try to remember: W for the rows and H for the columns. The latent features are the columns of W and the rows of H. When the data matrix is person x measures of the person, the transpose of H yields something that looks like a matrix of factor loadings with simple structure. W looks like a membership matrix from finite mixture models. Of course, you may need to rescale the values of W and H since W*H reproduces the raw data matrix, which can be counts or any arbitrary non-negative intensity scale.

    2. I have that book actually! How does this sound:

      When input is [person x attributes] then W is [person x latent dimensions] and H is [latent dimensions x attributes]. Thus, to get the weightings of each attribute in each latent dimension, look at the columns of H (each attribute is a column and the rows are the latent dimension strength). This is what I think you have done in your post(?)

      When input is [terms x documents] then W is [terms x latent dimensions (topics)] and H is [latent dimensions x documents]. Thus the columns of W describe the strength of each term for a given latent dimension (each column is a latent dimension). The columns of H show the strength or mix of each latent dimension for a given document (each is a column).

    3. Well said! I have had problems trying to explain this point, and your comment will help future readers. Thanks.

    4. Cool, glad I have it down finally. One thing that bugged me about reading Skillicorn (and I asked him about it but will confess I did not understand the answer) was I expected that if I ran NMF on a matrix and then ran it on the transpose of that same matrix, I would find the matrices switched and transposed (as I thought he showed on page 174) but in fact that is not the case and the matrices are completely different.