Wednesday, July 29, 2015

But I Don't Want to Be a Statistician!

"For a long time I have thought I was a statistician.... But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.... All in all, I have come to feel that my central interest is in data analysis...."

Opening paragraph from John Tukey "The Future of Data Analysis" (1962)

To begin, we must acknowledge that these labels are largely administrative based on who signs your paycheck. Still, I prefer the name "data analysis" with its active connotation. I understand the desire to rebrand data analysis as "data science" given the availability of so much digital information. As data has become big, it has become the star and the center of attention.

We can borrow from Breiman's two cultures of statistical modeling to clarify the changing focus. If our data collection is directed by a generative model, we are members of an established data modeling community and might call ourselves statisticians. On the other hand, the algorithmic modeler (although originally considered a deviant but now rich and sexy) took whatever data was available and made black box predictions. If you need a guide to applied predictive modeling in R, Max Kuhn might be a good place to start.

Nevertheless, causation keeps sneaking in through the back door in the form of causal networks. As an example, choice modeling can be justified as an "as if" predictive modeling but then it cannot be used for product design or pricing. As Judea Pearl notes, most data analysis is "not associational but causal in nature."

Does an inductive bias or schema predispose us to see the world as divided into causes and effects with features creating preference and preference impacting choice? Technically, the hierarchical Bayes choice model does not require the experimental manipulation of feature levels, for example, reporting the likelihood of bus ridership for individuals with differing demographics. Even here, it is difficult not be see causation at work with demographics becoming stereotypes. We want to be able to turn the dial, or at least selection different individuals, and watch choices change. Are such cognitive tendencies part of statistics?

Moreover, data visualization has always been an integral component in the R statistical programming language. Is data visualization statistics? And what of presentations like Hans Rosling's Let My Dataset Change Your Mindset? Does statistics include argumentation and persuasion?

Hadley Wickham and the Cognitive Interpretation of Data Analysis

You have seen all of his data manipulation packages in R, but you may have missed the theoretical foundations in the paper "A Cognitive Interpretation of Data Analysis" by Grolemund and Wickham. Sensemaking is offered as an organizing force with data analysis as an external tool to aid understanding. We can make sensemaking less vague with an illustration.

Perceptual maps are graphical displays of a data matrix such as the one below from an earlier post showing the association between 14 European car models and 27 attributes. Our familiarity with Euclidean spaces aid in the interpretation of the 14 x 27 association table. It summarizes the data using a picture and enables us to speak of repositioning car models. The joint plot can be seen as the competitive landscape and soon the language of marketing warfare brings this simple 14 x 27 table to life. Where is the high ground or an opening for a new entry? How can we guard against an attack from below? This is sensemaking, but is it statistics?

I consider myself to be a marketing researcher, though with a PhD, I get more work calling myself a marketing scientist. I am a data analyst and not a statistician, yet in casual conversation I might say that I am a statistician in the hope that the label provides some information. It seldom does.

I deal in sensemaking. First, I attempt to understand how consumers make sense of products and decide what to buy. Then, I try to represent what I have learned in a form that assists in strategic marketing. My audience has no training in research or mathematics. Statistics plays a role and R helps, but I never wanted to be a statistician. Not that there is anything wrong with that.

Monday, July 27, 2015

Statistical Models of Judgment and Choice: Deciding What Matters Guided by Attention and Intention

Preference begins with attention, a form of intention-guided perception. You enter the store thirsty on a hot summer day, and all you can see is the beverage cooler at the far end of the aisle with your focus drawn toward the cold beverages that you immediately recognize and desire. Focal attention is such a common experience that we seldom appreciate the important role that it plays in almost every activity. For instance, how are you able to read this post? Automatically and without awareness, you see words and phrases by blurring everything else in your perceptual field.

Similarly, when comparing products and deciding what to buy, you construct a simplified model of the options available and ignore all but the most important features. Selective attention simultaneously moves some aspects into the foreground and pushes everything else into the background, such is the nature of perception and cognition.

[Note: see "A Sparsity-Based Model of Bounded Rationality" for an economic perspective.]

Given that seeing and thinking are sparse by design, why not extend that sparsity to the statistical models used to describe human judgment and decision making? That cooler mentioned in the introductory paragraph is filled with beverages that fall into the goal-derived category of "things to drink on a hot summer day" and each has its own list of distinguishing features. The statistical modeling task begins with many options and even more distinguishing features so that the number of potential predictors is large. However, any particular individual selectively attends to only a small subset of products and features. This is what we mean by sparse predictive models - many variables in the equation but only a few with nonzero coefficients.

[Note: In order not to get lost in two different terminologies, one needs to be careful not to confuse sparse models with most parameters equal to zero and sparse matrices, which deals with storing and manipulating large data sets with lots of cells equal to zero.]

Statistical Learning with Sparsity

A direct approach might "bet on sparsity" and argue that only a few coefficients can be nonzero given the limitations of human attention and cognition. The R package glmnet will impose a budget on the total costs incurred from paying attention to many features when making a judgment or choice. Thus, with a limited span of attention, we would expect to be able to predict individual responses with only the most important features in the model. The modeler varies a tuning parameter controlling the limits of one's attention and watches predictors enter and leave the equation.

If everyone adopted the same purchase strategy, we could observe the purchase behavior of a group of customers and estimate a single set of parameters using glmnet. Instead of uniformity, however, we are more likely to find considerable heterogeneity with a mixture of different segments and substantial variation within each segment. All that is necessary to violate homogeneity is for the product category to have a high and low end, which is certainly the case with cold beverages. Now, the luxury consumer and the price sensitive will attend to different portions of the retail shelf and require that we be open to the possibility that our data are a mixture of different preference equations. Willingness to spend, of course, is but one of many possible ways of dividing up the product category. We could easily continue our differentiation of the cold beverage market by adding dimensions partitioning the cooler on the basis of calories, carbonation, coffees and teas, designer waters, alcohol content, and more.

Fragmented markets create problems for all statistical models assuming homogeneity and not just glmnet. Attention, the product of goal-directed intention, generates separated communities of consumers with awareness and knowledge of different brands and features within the same product category. The high-dimensional feature space resulting from the coevolving network of customer wants and product offerings forces us to identify a homogeneous consumer segment before fitting glmnet or any other predictive model. What matters in judgment and choice depends on where we focus our attention, which follows from our intentions, and intentions vary between individuals and across contexts (see Context Matters When Modeling Human Judgment and Choice).

Preference is constructed by the individual within a specific context as an intention to achieve some desired end state. Yet, the preference construction process tends to produce a somewhat limited result. A security camera placed by the rear beverage cooler would record a quick scan, followed by more activity as the search narrowed with a possible reset and another search begun or terminated without purchase. The beverage company has spent considerable money "teaching" you about their brand and the product category. You know what to look for before you enter the store because, as a knowledgeable consumer, you have learned a brand and product category representation and you have decided on an ideal positioning for yourself within this representation. For example, you know that beverages can be purchased in different size glass or plastic bottles, and you prefer the 12-ounce plastic bottle with a twist top.

Container preferences are but one of the building blocks acquired in order to complete the beverage purchase process. We can identify all the building blocks using nonnegative matrix factorization (NMF) and use that information to cluster consumers and features simultaneously. This is how we discover which consumers quickly find the regular colas and decide among the brands, types, favors, sizes, and containers available within this subcategory. Finally, we have our relatively homogenous dataset of regular cola drinkers for the glmnet R package to analyze. More accurately, we will have separate datasets for each joint customer-feature block and will need to derive different equations with different variables for different consumer segments.

Tuesday, July 21, 2015

"Models, Models Everywhere!" Brought to You by R

Statistical software packages sell solutions. If you go to the home page for SAS, they will tell you upfront that they sell products and solutions. They link both together under the first tab just below "The Power to Know" mantra. SPSS separates product and solution into separate tabs, but places both next to each other on its home page as the first and second clicks. Obviously, both companies are in the solutions business; you have a problem, they have a solution. It's a good positioning to attract customers who are overworked and over their heads. To be clear, no one is questioning the analytics. SPSS and SAS are not selling snake oil, but they are selling something that is designed to appeal to potential customers with more money than time to spend.

R, on the other hand, appeals to the analyst looking outside the traditional box filled with a limited set of statistical models that keep us collecting the same data year after year and running the same analysis each time. My example comes from marketing research where we are repeatedly asked to do "something multivariate" with ratings of idealized features (e.g., cost without price points, quality lacking any specifications, and customer service stripped of context). Before you propose to replace the rating with some ranking task (e.g., MaxDiff), let me remind you that the problem is not the rating but the abstract feature without referent.

The solution is to get concrete if only our analytic tools did not lag behind our data collection capabilities. With decontextualized features we could pretend that we were all on the same page and speaking of the same thing. The details, however, reveal the heterogeneity of product usage and experience. The global space defined by price, quality and service becomes parallel worlds with concentrations of customers paying different amounts for product versions of varying quality with diverging expectations and needs for service. I have many more variables and even more missing data. More importantly, I have non-overlapping customer-feature blocks accompanying each community held together by common usage occasions.

This characterization of the data as local places within a global space came, not from marketing research, but from matrix factorization techniques for recommender systems. Modeling preferences for movies and songs have altered the way we look at all consumption. Everything has become more complex. The traditional clustering models started with feature selection and one set of variables for everyone. Similarly, although factorial invariance across distinct populations might require some preliminary examination, we believed that ultimately we would be able to identify a common group of respondents with which we could perform all dimension reduction. After Netflix and Spotify, all we can see are niche-genre pairings of customers and product features.

Of course, all of this is brought to you by R. SAS and SPSS need a business model before they incorporate the latest procedures. R, on the other hand, provides a platform for innovation by others, academics and entrepreneurs, willing to share and promote their best work. The result is a continuous stream of new ways of seeing and thinking embedded in a diverse collection of models and algorithms, which we call R packages. You can find a listing of all the innovative approaches for jointly blocking the rows and columns of a data matrix under the heading "Simultaneous Clustering in R" in my post The Ecology of Data Matrices.

Models are everywhere and from everywhere. R provides the interface enabling us to lift our heads out of our box and peak into the box down the road in someone else's field.

Wednesday, July 15, 2015

Seeing Data as the Product of Underlying Structural Forms

Matrix factorization follows from the realization that nothing forces us to accept the data as given. We start with objects placed in rows and record observations on those objects arrayed along the top in columns. Neither the objects nor the measurements need to be preserved in their original form.

It is helpful to remember that the entries in our data matrix are the result of choices made earlier for everything that can be recorded is not tallied. We must decide on the unit of analysis (What objects go in the rows?) and the level of detail in our measurements (What variables go in the columns?). For example, multilevel data can be aggregated as we deem appropriate, so that our objects can be classroom means rather than individual students nested within classrooms and our measures can be total number correct rather than separate right and wrong for each item on the test.

Even without a prior structure, one could separately cluster the rows and the columns using two different distance matrices, that is, a clustering of the rows/columns with distances calculated using the columns/rows. As an example, we can retrieve a hierarchical cluster heat map from an earlier post created using the R heatmap function.

Here, yellow represents a higher rating, and lower ratings are in red. Objects with similar column patterns would be grouped together into a cluster and treated as if they constitute a single aggregate object. Thus, when asked about technology adoption, there appears to be a segment toward the bottom of the heatmap who foresee a positive return to investment and are not concerned with potential problems. The flip side appears toward the top with the pattern of higher yellows and lower reds suggesting more worries than anticipated joys. The middle rows seems to contain individuals falling somewhere between these two extremes.

A similar clustering could be performed for the columns. Any two columns with similar patterns down the rows can be combined and an average score calculated. We started with 8 columns and could end with four separate 2-column clusters or two separate 4-column clusters, depending on where we want to draw the line. A cutting point for the number of row clusters seems less obvious, but it is clear that some aggregation is possible. As a result, we have reordered the rows and columns, one at a time, to reveal an underlying structure: technology adoption is viewed in terms of potential gains and losses with individuals arrayed along a dimension anchored by gain and loss endpoints. 

Before we reach any conclusion concerning the usefulness of this type of "dual" clustering, we might wish to recall that the data come from a study of attitudes toward technology acceptance with the wording sufficiently general that every participant could provide ratings for all the items. If we had, instead, asked about concrete steps and concerns toward actual implementation, we might have found small and large companies living in different worlds and focusing on different details. I referred to this as local subspaces in a prior post, and it applies here because larger companies have in-house IT departments and IT changes a company's point-of-view.

To be clear, conversation is filled with likes and dislikes supported by accompanying feelings and beliefs. This level of generality permits us to communicate quickly and easily. The advertising tells you that the product reduces costs, and you fill in the details only to learn later that you set your cost-savings expectations too high.

We need to invent jargon in order to get specific, but jargon requires considerable time and effort to acquire, a task achieved by only a few with specialized expertise (e.g., the chief financial officer in the case of cost cutting). The head of the company and the head of information technology may well be talking about two different domains when each speaks of reliability concerns. If we want the rows of our data matrix to include the entire range of diverse players in technological decision making while keeping the variables concrete, then we will need a substitute for the above "dual" scaling.

We may wish to restate our dilemma. Causal models, such as the following technology acceptance model (TAM), often guide our data collection and analysis with all the inputs measured as attitude ratings.

Using a data set called technology from the R package plspm, one can estimate and test the entire partial least squares path model. Only the latent variables have been shown in the above diagram (here is a link to a more complete graphic), but it should be clear from the description of the variables in the technology data set that Perceived Usefulness (U) is inferred from ratings of useful, accomplish quickly, increase productivity and enhance effectiveness. However, this is not a model of real-world adoption that depends on so many specific factors concerning quality, cost, availability, service, selection and technical support. The devil is in the details, but causal modeling struggles with high dimensional and sparse data. Consequently, we end up with a model of how people talk about technology acceptance and not a model of the adoption process pushed forward by its internal advocates and slowed by its sticking points, detours and dead ends.

Yet, the causal model is so appealing. First comes perception and then follows intention. The model does all the heavy lifting with causation asserted by the path diagram and not discovered in the data. All I need is a rating scale and some general attitude statements that everyone can answer because they avoid any specifics that might result in a DK (don't know) or NA (not applicable). Although ultimately the details are where technology is accepted or rejected, they just do not fit into the path model. We would need to look elsewhere in R for a solution.

As I have been arguing for some time in this blog, the data that we wish to collect ought to have concrete referents. Although we begin with detailed measures, it is our hope that the data matrix can be simplified and expressed as the product of underlying structural forms. At least with technology adoption and other evolving product categories, there seems to be a linkage between product differentiation and consumer fragmentation. Perhaps surprisingly, matrix factorization with its roots as a computational routine from linear algebra seems to be able to untangle the factors responsible for such coevolving networks.

Thursday, July 9, 2015

The Ecology of Local Subspaces: Mixtures of Parochial Views

No matter where you live, your view of the world is biased and limited, which is the beauty of this magazine cover.

As a marketer, of course, all my maps depict, not place, but consumption. For example, in an earlier post I asked, "What apps are on your Smartphone?" It seemed like a reasonable question for a marketing researcher interested in cross-selling or just learning more about product usage. Nothing special about apps, the same question could have been raised about movies, restaurants, places you drive your car, or stuff that you buy using your credit card. In all these cases, there is more available than any one consumer will use or know. Regardless of what or how we consume, our awareness and familiarity will have the parochial feel of the New Yorker cartoon.

Not Really Big Data But Too Much for a Single Bite

Because we are surveying consumers, we are unlikely to collect data from more than a few thousand respondents, and there are limits to how much information we can obtained before they stop responding. This is not Netflix ratings or Facebook friending or Google searches. Still, the "world of apps" for the young teen in school and the business person using their phone for work are too distant to analyze in a single bite. Moreover, these worldviews cannot be disentangled by examining marginal row or column effects alone for they are a mixture of local subspaces defined by the joint clustering of rows and columns together.

I have titled this post "The Ecology of Local Subspaces" in the hope that our knowledge of ecological systems will help us understand consumption. Animals are mobile, yet the features that enable the movement of fish, birds, and mammals tend to be distinct since they inhabit different subspaces. Hoofed animals are not more or less similar because they do not have wings or fins. Two individuals are similar if they have the same apps on their Smartphones only after we have identified the local subspace of relevant apps that define their similarity. Same and different have meaning only within a given frame of reference. Different apps will enter the similarity computations for those with differing worldviews determined by their situation and usage patterns. An earlier post provides a more detail account and suggests several R packages for simultaneous clustering.

Matrix Factorization Yields a Single Set of Joint Latent Factors 

I have already rejected the traditional approach of marginal row and column clustering. That is, one could cluster the rows using all the columns to calculate distances, either a distance matrix for all the rows or distances to proposed cluster centroids as in k-means clustering. Nothing forces us to make these distances Euclidean, and there are other distance measures in R. The same clustering could be attempted for the columns, although some prefer a factor analysis if they can obtain a reasonable correlation matrix given the columns are binary and sparse.

Our focus, however, is on the matrix whose block pattern is determined by the joint action of the rows and columns together. We seek a single set of joint latent factors with row contributions and column weights that reproduce the data matrix when their product is formed. What forces are at work in the Smartphone apps market that might generate such local subspaces? One could see our joint latent factors as the result of the need to accomplish desired tasks. Then, the apps co-reside on the same Smartphone because together they accomplish those desired tasks, and users who want to perform those tasks receive high scores on the corresponding joint latent factor.

Suppose that we start with 1000 respondents and 100 apps to form a 1000 x 100 data matrix with 100,000 data points (1000 x 100 = 100,000). Finding 10 joint latent factors would mean that we could approximate the 100,000 data points by multiplying a 1000 x 10 user factor score matrix times a 10 x 100 apps factor loading matrix. The matrix multiplication would look like the following diagram (except for more rows and columns for all three matrices) with W containing the 10 factor scores for the 1000 users and H holding the 10 sets of factor loadings for the 100 apps.

The data reduction is obvious given that before I needed 100 apps indicators to describe a user, now I need only 10 joint latent factor scores. Clearly, in order to reproduce the original data matrix, we will require the factor loadings telling us which apps load on which joint latent factors.

More importantly, we will have made the representation simpler by naming the joint latent factors responsible for the data reduction. Apps are added for a purpose, in fact, multiple apps may be installed together or over time to achieve a common goal. The coefficient matrix H displays those purposes as the joint latent factors and the apps that serve those purposes as the factor loadings in each row. The simplification is complete with W that describes the users in terms of those purposes or joint latent factors and not by a listing of the original 100 apps. 

Our hope is to discover an underlying structure revealing the hidden forces generating the observed usage patterns across diverse user communities.

Monday, July 6, 2015

Regression with Multicollinearity Yields Multiple Sets of Equally Good Coefficients

The multiple regression equation represents the linear combination of the predictors with the smallest mean-squared error. That linear combination is a factorization of the predictors with the factors equal to the regression weights. You may see the words "factorization" and "decomposition" interchanged, but do not be fooled. The QR decomposition or factorization is the default computational method for the linear model function lm() in R. We start our linear modeling by attempting to minimize least square error, and we find that a matrix computation accomplishes this task fast and accurately. Regression is not unique for matrix factorization is a computational approach that you will rediscover over and over again as you add R packages to your library (see Two Purposes for Matrix Factorization).

Now, multicollinearity may make more sense. Multicollinearity occurs when I have included in the regression equation several predictors that share common variation so that I can predict any one of those predictors from some linear combination of the other predictors (see tolerance in this link). In such a case, it no longer matters what weights I give individual predictors for I get approximately the same results regardless. That is, there are many predictor factorizations yielding approximately the same predictive accuracy. The simplest illustration is two highly correlated predictors for which we obtain equally good predictions using any one predictor alone or any weighted average of the two predictors together. "It don't make no nevermind" for the best solution with the least squares coefficients is not much better than the second best solution or possibly even the 100th best solution. Here, the "best" solution is defined only for this particular dataset before we ever begin to talk about cross-validation.

On the other hand, when all the predictors are mutually independent, we can speak unambiguously about the partitioning of R-squared. Each independent variable makes its unique contribution, and we can simply add their impacts for the total is truly the sum of the parts. This is the case with orthogonal experimental designs where one calculates the relative contribution of each factor, as one does in rating-based conjoint analysis where the effects are linear and additive. However, one needs to be careful when generalizing from rating-based to choice-based conjoint models. Choice is not a linear function of utility so that the impact on share from changing any predictor depends on the values of all the predictors, including the predictor being manipulated. Said differently, the slope of the logistic function is not constant but varies with the values of the predictors.

We will ignore nonlinearity in this post and concentrate on non-additivity. Our concern will be the ambiguity that enters when the predictors are correlated (see my earlier post on The Relative Importance of Predictors for a more complete presentation).

The effects of collinearity are obvious from the formula calculating R-squared from the cells of the correlation matrix between y and the separate x variables. With two predictors, as shown below by the subscripts 1 and 2, we see that R-squared is a complex interplay of the separate correlations of each predictor with y and the interrelationships among the predictors. Of course, everything simplifies when the predictors are independent with r(1,2)=0 and the numerator reducing to the sum of the squared correlations of each predictor with y divided by a denominator equal to one.

The formulas for the regression coefficients mirror the same "adjustment" process. If the correlation between the first predictor and y represents the total effect of the first variable on y, then the beta weight shows the direct effect of the first variable after removing the its indirect path through the second predictor. Again, when the predictors are uncorrelated, the beta weight equals the correlation with y.

We speak of this adjustment as controlling for the other variables in the regression equation. Since we have only two independent variables, we can talk of the effect of variable 1 on y controlling for variable 2. Such a practice seems to imply that the meaning of variable 1 has not been altered by the controlling for variable 2. We can be more specific by letting variable 1 be a person's weight, variable 2 be a person's height and the dependent variable be some measure of health. What is the contribution of weight to health controlling for height? Wait a second, weight controlling for height is not the same variable as weight. We have a term for that new variable; we call it obesity. Simply stated, the meaning of a term changes as move from the marginal (correlations) to conditional (partial correlations) representations.

None of this is an issue when our goal is solely prediction. Yet, the human desire to control and explain is great, and it is difficult to resist the temptation to jump from association to causal inference. The key is not to accept the data as given but to search for a representation that enables us to estimate additive effects. One alternative treats observed variables as the bases for latent variable regression in structural equation modeling. Another approach, nonnegative matrix factorization (NMF), yields a representation in terms of building blocks that can additively be combined to form relatively complex structures. The model does not need to be formulated as a matrix factorization problem in order for these computational procedures to yield solutions.

Friday, July 3, 2015

The Nature of Heterogeneity in Coevolving Networks of Customers and Products

The genre shirt asks, "What kind of music do u listen 2?"

Microgenres exist because markets are fragmenting and marketers need names to attract emerging customer segments with increasingly specific preferences. The cost of producing and delivery music now supports a plenitude of joint pairings of recordings and customers. The coevolution of music and its audience binds together listening preferences and available alternatives.

I already know a good deal about your preferences by simply knowing that you listen to German Hip Hop or New Orleans Jazz (see the website Every Noise at Once). Those microgenres are not accidental but were named in order to broadcast the information that customers need in order to find what they want to buy and at the same time that artists require to market their goods. Matchmaking demands its own vocabulary. Over time, the language adapts and only the "fittest" categories survive.

R Makes It Easy

The R package NMF simplifies the analysis as I demonstrated in my post on Modeling Plenitude and Speciation. Unfortunately, the data in that post were limited to only 17 broad music categories, but the R code would have been the same had there been several hundred microgenres or several thousand songs.

The output is straightforward once you understand what nonnegative matrix factorization (NMF) is trying to accomplish. All matrix factorizations, as the name implies, attempt to identify "simpler" matrices or factors that will reproduce approximately the original data matrix when multiplied together. Simpler, in this case, means that we replace the many observed variables with a much smaller number of latent variables. The belief is that these latent variables will simultaneously account for both row cliques and column microgenres as they coevolve.

This is the matrix factorization diagram that I borrowed from Wikipedia to illustrate the process.

The original data matrix V holds the non-negative numbers indicating the listening intensities for every microgenre is approximated by multiplying a customer graded membership matrix W times a matrix of factor loadings for the observed variables H. The columns of W and the rows of H reflect the number of latent variables. It can be argued that V contains noise so that reproducing it exactly would be overfitting. Thus, nothing of "true" value is lost by replacing the original observations in V with the customer latent scores in W. As you can see, W has fewer columns, but I can always approximate V using the factor loadings in H as a "decoder" to reconstitute the observations without noise (a form of data compression).

Interpreting the Output

We start our interpretation of the output with H containing the factor loadings for the observed variables. For example, we might expect to see one of the latent variables reflecting interest in classical music with sizable factor loadings for the microgenres associated with such music. One of these microgenre could serve as an "anchor" that defines this latent feature if only the most ardent classical music fans listened to this specific microgenre (e.g., Baroque). Hard rock or jazz might be anchors for other latent variables. If we have been careful to select observed variables with high imagery (e.g., is it easy to describe the hard rock listener?), we will be able to name the latent variable based only on its anchors and other high loadings.

The respondent matrix W serves a similar function for customers. Each column is the same latent variable we just described in our interpretation of H. For every row, W gives us a set of graded membership indices telling us the degree to which each latent variable contributes to that respondent's pattern of observed scores. The opera lover, for instance, would have a high membership index for the latent variable associated with opera. We would expect to find a similar pairing between the jazz enthusiasts and a jazz-anchored latent variable, assuming that this is the structure underlying the coevolving network of customer and product.

Once again, we need to remember that nonnegative matrix factorization attempts to reproduce the observed data matrix as closely as possible given the constraints placed on it by limiting the number of latent variables and requiring that all the entries be non-negative. The constraint on the number of latent variables should seem familiar to those who have some experience with principal component or factor analysis. However, unlike principal component analysis, NMF adds the restriction that all the elements of W and H must be non-negative. The result is a series of additive effects with latent variables defined as nonnegative linear combinations of observed variables (H) and respondents represented as convex combinations of those same latent variables (W). Moreover, latent variables can only add to the reproduced intensities as they attempt to approximate the original data matrix V. Lee and Seung explain this as "learning the parts of objects."

Why NMF Works with Marketing Data

All this works quite nicely in fragmented markets where products evolve along with customers desires to generate separate communities. In such coevolving networks, the observed data matrix tends to be sparse and requires a simultaneous clustering of rows and columns. The structure of heterogeneity or individual differences is a mixture of categories and graded category memberships. Customers are free to check as many boxes as they wish and order as much as they like. As a result, respondents can be "pure types" with only one large entry in their row of W and all the other values near zero, or they can be "hybrids" with several substantial membership entries in their row.

Markets will continue to fragment with customers in control and the internet providing more and more customized products. We can see this throughout the entertainment industry (e.g., music, film, television, books, and gaming) but also in almost everything online or in retail warehouses. Increasingly, the analyst will be forced to confront the high-dimensional challenge of dealing with subspaces or local regions populated with non-overlapping customers and products. Block clustering of row and columns is the first step when the data matrix is sparse because these variables over here are relevant only for those respondent over there. Fortunately, R provides the link to several innovative methods for dealing with such case (see my post The Ecology of Data Matrices).