We begin with a data matrix, a set of numbers arrayed so that each row contains information from a different consumer. Marketing research focuses on the consumer, but the columns are permitted more freedom, although they ought to tell us something about consumer perception, preference or consumption. Whatever data have been collected, we organize it into a data matrix with well-defined rows and columns.
Now, let us search for patterns in those numbers. In particular, do we need all those columns? Is there no redundancy among the multiple measures that we have collected? And what about consumers? Is each one unique or can they all be described as variation on a few common types?
We might be in luck. Products are designed to appeal to large enough consumer communities in order to be profitable, and consumers in similar situations with the same needs and desires share common perceptions. In addition, consumers talk to each other, and they listen to the "experts" and the marketing communications from providers of goods and services. Product markets are socially constructed knowledge structures. As such, consumer heterogeneity cannot be confined to seeing the same things differently (e.g., calculating distances between rows based on all the columns in the data matrix). Heterogeneity is simultaneously in both the rows and the columns with each consumer community focusing on its own columns.
R provides a good illustration. What other products would you include with R in the same product category? Asked differently, if R were not available, what would you use instead (e.g., SAS, SPSS, Stata, MATLAB, Python)? Your answer would certainly depend on your personal experience as shaped by what others do and say. We tell each other stories, and those stories focus our attention on specific products and features. Free and open is the tale told by R, while SAS speaks of fast solutions. At times it seems that the two user groups live in different worlds or at least in separate gated communities. As a result, the rows and the columns are not independently formed but shaped together by feedback loops and co-creation. This will be the case whenever the rows contain distinct user groups and the columns span all the separating aspects with individuals focusing only on their own relevant subset.
Grocery shopping at a supermarket ought to remind you of how different we are and how differentiated products have become in order to compete for our business. Those of us without very young children know little of the baby food aisle. In fact, knowing where they keep the stuff you want is the prime task of grocery shopping. Products are organized and so are the paths that consumers take through the store. If you want to upset your customers, start moving products from one aisle to another. Grocery store remodeling is disruptive due to our shopping habits that get us through a complex task with as little effort as possible. Our focus is sufficient that we do not even notice the products that we do not buy. Shoppers with children see a different store than the empty nester.
Exploiting Heterogeneity to Simplify the Data Matrix
Can we exploit the heterogeneity shared by products and consumers in order to extract a preference structure that is more concise than the original data matrix? Of course, we need to define concise, such as, a reduced-dimensional representation is more concise than a higher-dimensional representation or a co-clustering of rows and columns is more concise than a larger data matrix. The claim is that matrix factorization yields something more concise than the original data.
Why should a computational algorithm from linear algebra provide insights into consumer thinking and behavior? In general, it does not. Even when we transform the data matrix into correlations and perform a factor analysis, we still must rotate the solution to arrive at some simple structure that can be interpreted. Factor analysis is a factorization of some version of the correlation matrix (e.g., the correlation with communalities in the diagonal or just the unstandardized variance-covariance matrix as in confirmatory factor analysis). The equation R = FF' illustrates the factorization of the correlation matrix R into the matrix product of the factor loadings F. If this notation is confusing or if you simply want a reasonably paced introduction to factor analysis using matrix notation, Ben Lambert's YouTube playlist called Factor Analysis and SEM might be a good place to start.
However, for many of us, it is the data matrix itself that is of interest. We can divide the task into distinct steps. A factor analysis calculates correlations from individuals and provides a structure for the columns. A cluster analysis computes distances between pairs of individuals using the columns and outputs a soft or hard index of membership (e.g., all-or-none or mixing coefficients indicating probability). But if we want both simultaneously, we will need to look elsewhere.
The Biplot as Matrix Factorization
One way to introduce matrix factorization is through the biplot. The biplot displays both rows as points and columns as vectors in the same principal component space. This can be achieved in R in many different ways including FactoMineR and BiplotGUI. Since the goal is a visual plot, the biplot works best when the data reside in a two- or three-dimensional space. More dimensions can be extracted, and this would be helpful if the additional dimensions could be named and interpreted.
As Greenacre demonstrates in a series of examples from his online book, it is the visualization that make biplots useful in practice and that visualization depends on seeing rows and columns in a low-dimensional display. That is, the dimensions provide the scaffolding but never need to be interpreted. Instead, column vectors establish the meaning of different directions and row points define regions and local continuities in the column space. This is the approach I took in the above FactoMineR link using biplots to map cluster solutions.
Biplots are constructed with a matrix factorization, relying on the singular value decomposition (SVD) to derive the underlying latent variables. Principal component analysis is how many of us first learned about the SVD. In general, more abstract measures (e.g., reasonable price, high quality and good service) tend to fall into a low-dimensional space, often a single dimension presenting the intensity of product usage or increasing demand for higher-end products and features. By keeping the questions abstract, we fool ourselves into believing that everyone is answering the same questions. However, what constitutes a reasonable price or high quality is not the same for different consumer segments. The devil is in the details, so we keep our questions vague and hope for the best.
Nonnegative matrix factorization (NMF) takes a different approach from SVD by dropping the requirement that the dimensions be orthogonal and imposing additional constraints that all the entries be nonnegative. Although NMF is not always successful, the goal is to construct the whole by identifying the parts (see Lee and Seung). For example, one has many brands choices when it comes to the purchase of Scotch whiskey. In a previous post I showed how this market can be decomposed into a few brand-specific loyalists who buy only one or possibly two brands plus a more varied group who spread their purchases over several different single malt whiskeys.
The data matrix from the prior post contained a binary buy/no buy over the last year with 2218 customers in the rows and 21 brands in the columns. What are the factors that generate such a data matrix? NMF builds the total market one customer type at a time. There is a segment that buys only the market leader (Chivas Regal), and another that buys only the second most popular brand (Dewar's White Label). These two loyalty clusters are large enough to form their own components. No doubt there are some who buy both brands in varying proportions (e.g., one for special occasions and the other for everyday). Such mixtures do not require a separate component because NMF allows soft component mixtures (e.g., 80% one component and 20% the other). This is what we mean when we say that latent components are the building blocks.
As we continue our decomposition of the data matrix, we undercover two groups who each buy two similar brands and finally the single malt cluster. These latent components or features simultaneously define both the brands and the customer types. Brands are differentiated because they are purchased by different customers, and customers are clustered based on the different brands they buy. The substantive argument is that the columns are manifestation of this hidden structure defined by the brands. Of course, we could have seen purchases guided by variety seeking or price sensitivity (e.g., discovering new brands with different selling propositions or buying whatever is on sale). Yet, brand loyalty is a well-known practice among consumers, and we should not be surprised to find it here with something as important as drinking Scotch whiskey.
When will NMF yield consumer insights? NMF builds the whole by identifying the component parts needed to reconstruct the data matrix and then expressing each separate individual as some mixture of those components. Can the trip to the grocery store be seen as such a decomposition? There are household tasks that demand purchases: coffee in the morning, cleaning supplies, food for snacks and meals, replacement light bulbs or batteries, and so on. There are many tasks to be done, each requiring it own set of purchases. The whole, all that is purchased during the shopping trip, can be seen as varying mixtures of these separate requirements. NMF success or failure results from the correspondence between matrix factorization and the purchase processes consumers undertake in order to deal with the problems and tasks they confront to varying degrees.