Wednesday, May 20, 2015

Clusters Powerful Enough to Generate Their Own Subspaces

Cluster are groupings that have no external label. We start with entities described by a set of measurements but no rule for sorting them by type. Mixture modeling makes this point explicit with its equation showing how each measurement is an independent draw from one of K possible distributions.


Each row of our data matrix contains the measurements for a different object, represented by the vector x in the above equation. If all the rows came from a single normal distribution, then we would not need the subscript K. However, we have a mixture of populations so that measurements come from one of the K groups with probability given by the Greek letter pi. If we knew K, then we would know the mean mu and covariance matrix sigma that describe the Gaussian distribution generating our observation.


The above graphical model attempts to illustrate the entire process using plate notation. That is, the K and the N in the lower right corner of the two boxes indicate that we have chosen not to show all of the K or N different boxes, one for each group and one for each observation, respectively. The arrows represent directed effects so that group membership in the box with [K] is outside the measurement process. With K known, the corresponding mean and variance act as input to generate one of the i = 1,...,N observations.

This graphical model describes a production process that may be responsible for our data matrix. We must decide on a value for K (the number of clusters) and learn the probabilities for each of the K groups (pi is a K-valued vector). But we are not done estimating parameters. Each of the K groups has a mean vector and a variance-covariance matrix that must be estimated, and both depend on the number of columns (p) in the data matrix: (1) Kp means and (2) Kp(p+1)/2 variances and covariances. Perhaps we should be concerned that the number of parameters increases so rapidly with the number of variables p.

A commonly used example will help us understand the equation and the graphical model. The Old Faithful dataset included with the R package mclust illustrates that eruptions from the geyser can come from one of two sources: the brief eruptions in red with shorter waiting times and the extended eruptions in blue with longer waiting periods. There are two possible sources (K=2), and each source generates a bivariate normal distribution of eruption duration and waiting times (N=number of combined red squares and blue dots). Finally, our value of pi can be calculated by comparing the number of red and blue points in the figure.



Scalability Issues in High Dimensions

The red and the blue eruptions reside in the same two-dimensional space since the clustering depends only on duration. This would not be the case with topic modeling, for example, where each topic might be defined by a specific set of anchor words that would separate each topic from the rest. Similarly, if we were to cluster by music preference, we would discover segments with very specific awareness and knowledge of various artists. Again, the music preference groupings would be localized within different subspaces anchored by the more popular artists within that genre. Market baskets appear much the same with each filled with the staples and then those few items that differentiate among segments (e.g., who buys adult diapers?). In each of these cases, as with feature usage and product familiarity, we are forced to collect information across a wide range of measures because each cluster requires its own set of variables to distinguish itself from the others.

These clusters have been created by powerful forces that are stable over time: major events (e.g., moving out on your own, getting married, buying a house, having a child or retiring) and not so major events (e.g., clothes for work, devices to connect to the internet, or what to buy for dinner). Situational needs and social constraints focus one's attention so that any single individual can become familiar with only a small subset of all that we need to measure in order to construct a complete partition. Your fellow cluster members are others who find themselves in similar circumstances and resolve their conflict in much the same way.

As a result, the data matrix becomes high dimensional with many columns, but the rows are sparse with only a few columns of any intensity for any particular individual. We can try to extend the mixture model so that we can maintain model-based clustering with high-dimensional data (e.g., subspace clustering using the R package HDclassif). The key is to concentrate on the smaller intrinsic dimensionality responsible for specific cluster differences.

Yet, I would argue that nonnegative matrix factorization (NMF) might offer a more productive approach. This blog is filled with posts demonstrating how well NMF works with marketing data, which is reassuring. More importantly, the NMF decomposition corresponds closely with how products are represented in human cognition and memory and how product information is shared through social interactions and marketing communications.

Human decision making adapts to fit the demands of the problem task. In particular, what works with side-by-side comparisons across a handful of attributes for two or three alternatives in a consideration set will not fill our market baskets or help us select a meal from a long list of menu items. This was Herbert Simon's insight. Consumer segments are formed as individuals come to share a common understanding of what is available and what should be preferred. In order to make a choice, we are required to focus our attention on a subspace of all that is available. NMF mimics this simplification process, yielding interpretable building blocks as we attempt to learn the why of consumption.

Wednesday, May 13, 2015

What is Data Science? Can Topic Modeling Help?

Predictive analytics often serves as an introduction to data science, but it may not be the best exemplar given its long history and origins in statistics. David Blei, on the other hand, struggles to define data science through his work on topic modeling and latent Dirichlet allocation. In Episode 10 of Talking Machines, Blei discusses his attempt to design a curriculum for the Data Science Institute at Columbia University. The interview starts at 9:20. If you do not wish to learn about David's career, you can enter the conversation at 13:10. However, you might want to listen all the way to the end because we learn a great deal about data science by hearing how topic modeling is applied across disciplines. Over time data science will be defined as individuals calling themselves "data scientists" change our current practices.

The R Project for Statistical Computing assists by providing access to a diverse collection of applications across fields with differing goals and perspectives. Programming forces us into the details so that we cannot simply talk in generalities. Thus, topic modeling certainly allows us to analyze text documents, such as newspapers or open-ended survey comments. What about ingredients in food recipes? Or, how does topic modeling help us understand matrix factorization? The ability to "compare and contrast" marks a higher level of learning in Bloom's taxonomy.

While visiting Talking Machines, you might also want to download the MP3 files for some of the other episodes. The only way to keep up with the increasing number of R packages is to understand how they fit together into some type of organizational structure, which is what a curriculum provides.

You can hear Geoffrey Hinton, Yoshua Bengio and Yann LeCun discuss the history of deep learning in Episodes 5 and 6. If nothing else, the conversation will help you keep up as R adds packages for deep neural networks and representation learning. In addition, we might reconsider our old favorites, like predictive analytics, with a new understanding. For example, what may be predictive in choice modeling might not be the individual features as given in the product description but the holistic representation as perceived by a consumer with a history of similar purchases in similar situations. We would not discover that by estimating separate coefficients for each feature as we do with our current hierarchical Bayesian models. Happily, we can look elsewhere in R for models that can learn such a product representation.

Monday, May 11, 2015

Centering and Standardizing: Don't Confuse Your Rows with Your Columns

R uses the generic scale( ) function to center and standardize variables in the columns of data matrices. The argument center=TRUE subtracts the column mean from each score in that column, and the argument scale=TRUE divides by the column standard deviation (TRUE are the defaults for both arguments). For instance, weight and height come in different units that can be compared more easily when transformed into standardized deviations. Since such a linear transformation does not alter the correlations among the variables, it is often recommended so that the relative effects of variables measured on different scales can be evaluated. However, this is not the case with the rows.

A concrete example will help. We ask a group of consumers to indicate the importance of several purchase criteria using a scale from 0=not at all important to 10=extremely important. We note that consumers tend to use only a restricted range of the scale with some rating all the items uniformly higher or lower. It is not uncommon to interpret this effect as a measurement bias, a preference to use different portions of the scale. Consequently, we decide to "correct for scale usage" by calculating deviation scores. First, we compute the mean score for each respondent across all the purchase criteria ratings, and then we subtract that mean from each rating in that row so that we have deviation scores. The mean of each consumer or row is now zero. Treating our transformed scores like any other data, we run a factor analysis using our "unbiased" deviation ratings.

Unfortunately, we are now measuring something different. After row-centering, individuals with high product involvement who place considerable importance on all the purchase criteria have the same rating profiles as those more casual users who seldom attend to any of the details. In addition, by forcing the mean for every consumer to equal zero, we have created a linear dependency among the p variables. That is, we started with p separate ratings that were free to vary and added the restriction that the p variables sum to zero. We lose one degree of freedom when we compute scores that are deviations about the mean (as we lose one df in the denominator for the standard deviation and divide by n-1 rather than n). The result is a singular correlation matrix that can no longer be inverted.

Seeing is believing

The most straightforward way to show the effects of row-centering is to generate some multivariate normal data without any correlation among the variables, calculate the deviation scores about each row mean, and examine any impact on the correlation matrix. I would suggest that you copy the following R code and replicate the analysis.

library(MASS)
 
# p is the number of variables
p<-11
 
# simulates 1000 rows with
# means=5 and std deviations=1.5
x<-mvrnorm(n=1000,rep(5,p),diag((1.5^2),p))
summary(x)
apply(x, 2, sd)
 
# calculate correlation matrix
R<-cor(x)
# correlations after columns centered
# i.e., column means now 0's not 5's
x2<-scale(x, scale=FALSE)
summary(x2)
apply(x2, 2, sd)
R2<-cor(x2)
round(R2-R,8) # identical matrices
 
# checks correlation matrix singularity
solve(R)
 
# row center the ratings
x_rowcenter<-x-apply(x, 1, mean)
RC<-cor(x_rowcenter)
round(RC-R,8) # uniformly negative
 
# row-centered correlations singular
solve(RC)
 
# orginal row means normally distributed
hist(round(apply(x,1,mean),5))
 
# row-centered row means = 0
table(round(apply(x_rowcenter,1,mean),5))
 
# mean lower triangular correlation
mean(R[lower.tri(R, diag = FALSE)])
mean(RC[lower.tri(R, diag = FALSE)])
 
# average correlation = -1/(p-1)
# for independent variables
# where p = # columns
-1/(p-1)

[Created by Pretty R at inside-R.org]

I have set the number of variables p to be 11, but you can change that to any number. The last line of the R code tells us that the average correlation in a set of p independent variables will equal -1/(p-1) after row-centering. The R code enables you to test that formula by manipulating p. In the end, you will discover that the impact of row-centering is greatest with the fewest number of uncorrelated variables. Of course, we do not anticipate independent measures so that it might be better to think in terms of underlying dimensions rather than number of columns (e.g., the number of principal components suggested by the scree test). If your 20 ratings tap only 3 underlying dimensions, then the p in our formula might be closer to 3 than 20.

At this point, you should be asking how we can run regression or factor analysis when the correlation matrix is singular. Well, sometimes we will get a warning, depending on the rounding error allowed by the package. Certainly, the solve function was up to the task. My experience has been that factor analysis packages tend to be more forgiving than multiple regression. Regardless, a set of variables that are constrained to sum to any constant cannot be treated as if the measures were free to vary (e.g. market share, forced choice, rankings, and compositional data).

R deals with such constrained data using the compositions package. The composition of an entity is represented by the percentages of the elements it contains. The racial or religious composition of a city can be the same for cities of very different sizes. However, it does not matter whether the sum is zero, one or a hundred. Forced choice, such as MaxDiff (Sawtooth's MaxDiff=100*p), and ranking tasks (sum of the ranks) are also forced to add to a constant.

I have attempted to describe some of the limitations associated with such approaches in an earlier warning about MaxDiff. Clearly, the data analysis becomes more complicated when we place a priori restrictions on the combined values of a set of variables, suggesting that we may want to be certain that our work requires us to force a choice or row center.

Friday, May 8, 2015

What Can We Learn from the Apps on Your Smartphone? Topic Modeling and Matrix Factorization

The website for The Burning House begins with a simple question:
If your house was burning, what would you take with you? It's a conflict between what's practical, valuable and sentimental. What you would take reflects your interests, background and priorities. Think of it as an interview condensed into one question.
But what about the more mundane and everyday stuff? As an example from an earlier post, I borrowed a quote popularized by the Iron Chef, "Tell me what you eat, and I will tell you what you are."

Do the choices we make reveal some underlying trait or situational demands that will enable us to predict future behavior? If one gathers up all the valuables and leaves behind the wedding album to burn in the house, can we guess more accurately using this information how long the marriage will last?


What about the apps on your smartphone? The list is long and growing with games, lifestyle, travel, music and entertainment, plus utilities, education, books, reference and business. Such a categorization imposes an organization that may not reflect the patterns we see in actual app download and usage. As an analogy, we can divide the supermarket into aisles marked with different signage (bread here in the middle of the store and butter over there along the wall), yet the market baskets rolled through the checkout reflect a very different associative network. Your phone is the basket, and your apps are the items purchased (or frequency of app usage equals amount of item bought). What is in your basket reflects your personal combination of wants and needs (e.g., a special trip before the big game on TV or your primary shopping for the week).

An alternative approach might treat this as a form of topic modeling with your phone as a document and app usage levels as frequency of word occurrences. Topics are the latent variables that generate the pattern of app co-occurrences (e.g., similar interests, needs or networks). The R package stm for Structural Topic Modeling may provide a gentle introduction for the social scientist, although Latent Dirichlet Allocation (LDA) still remains one step beyond the statistical training for most. This, of course, will change as more researchers are motivated to learn the mathematics given the promise of easier to use R packages.

Meanwhile, nonnegative matrix factorization (NMF) accomplishes a similar task with decomposition techniques from linear algebra. We start with the assumption that there are underlying usage patterns, such as, taking pictures, emailing them or posting to Facebook. Several apps are likely to achieve the same goal. The purposes that organize app usage are the latent variables. These same latent variables are also responsible for similarity among users. Users are similar because they use the same apps, but they use the same apps because they share the same motivations or purposes represented by the latent variables. The apps work together for like users to achieve the same ends.

To simplify, we can think of this as a joint factor analysis of the apps and a cluster analysis of the users. Thus, from a single data matrix NMF delivers two matrices: (1) "factor loadings" showing the relationship between the latent variables and the apps and (2) cluster membership weights reported for each user indicating the contribution of each latent variable to that user's app usage profile. We now have a matrix factorization or decomposition:

[ usage data ] = [ user latent profile ] x [ app latent profile ].

Do you play games on your phone to pass the time while waiting? These games are likely not the same ones you would play if you wanted to compete against others. More than one game can accomplish the same goal, so multiple apps "load" on the same latent variable. A user might do both at different times, therefore, that user belongs in both "clusters" (with each user cluster defined by a different latent variable). In NMF, the whole can be generated as the sum of the parts, which I have illustrated in my building blocks post.

Nielsen reports that less than 30 apps are used by the average smartphone owner. Can we agree that the usage data matrix will be sparse, given the number of available apps? However, we expect to find user-by-app blocks with higher densities. Such "clumping" occurs in high-dimensional spaces when users are heterogeneous with different groupings of wants and needs. These user clusters seek out blocks of apps that serve their purposes, creating joint clusters of uses and apps. This is what we uncover with NMF and LDA.

Note: My post on Brand and Product Category Representation ends with a list of examples containing the R code needed to run the R package NMF and perform the type of analysis reviewed here.

Monday, May 4, 2015

Clusters May Be Categorical but Cluster Membership Is Not All-or-None

Very early in the study of statistics and R, we learn that random variables can be either categorical or continuous. Regrettably, we are forced to relearn this distinction over and over again as we debug error messages produced by our code (e.g., "x must be numeric"). R will reminds us that if the function expects an argument to be a factor, our input ought to be a factor (although sometimes the function will do the conversion for us). Dichotomous variables do give us some flexibility for sex can be entered as a factor with values 'male' and 'female' or coded as numeric with values of 0 and 1 indicating degree of 'maleness' or 'femaleness' depending on whether male or female is assign the value of 1. Similarly, when the categorical variable has many levels, there is no reason not to select one of the levels as the basis for comparison. Then, the dummy coding remains 0 and 1 with the base level coded as 0s for all the comparisons (e.g., Catholic vs Protestant, Jewish vs Protestant, and so on).

Categories vs. Dimensions or Continuous vs. Discrete

The debate over psychiatric classification has bought the battle into the news, as has changes in the admission policies of women's colleges to accept transgender applicants. I have discussed the issue under both clustering and latent variable modeling. It seems just too easy to dissolve the boundaries and blur the distinctions for almost any categorization scheme. For instance, race is categorical, and one is either European or Asian, unless of course, they are some mixture of the two. I have borrowed that example and the following figure from a video lecture by Katherine Heller (also shown as Figure 1 in her paper).


We are familiar with finite mixture model from the mclust R package. Although I have shown only the contour ellipsis, you should be able to imagine the two multivariate normal distributions in some high-dimensional space that would separate Europeans (perhaps the blue ellipse) and Asians (which then must be in green). As geographical barriers fall, racial membership becomes partial with many shades or groupings between the ideal types of the finite mixture model.

Both the finite mixture model (FMM) and the mixed membership model (MMM) permit data points to fall between the centroids of the two most extreme densities. For example, a finite mixture model will yield a probability of cluster membership that can range from 0 to 1. However, the probability from the finite mixture model represents classification error, which increases with cluster overlapping. This is not unlike the misclassification from a discriminant analysis, that is, the groups are distinct but we are unable to separate them completely with the available data. The probabilities from the partial or mixed membership model, on the other hand, do not represent uncertainty but a span of possible clusters arrayed between the two extreme ideals.

The analysis of the voting record from United States senators presented toward the end of Heller's paper might help us understand this previous point. One might infer a latent continuum that separates Democrats and Republicans, but the distribution is not uniform. Democrats tend to bunch at one end and Republicans clump at the other. In between are Democrats from more conservative southern states and Republicans from more liberal northern states. One might argue that the voting dynamics or data generation processes are different for these clusters so that it makes sense to think of the continuum as separated into four regions with boundaries imposed by the political forces constraining their votes.

Interestingly, we learn something about the bills and resolutions in the process of accounting for differences among senators. Some votes are not strictly party-line. For example, senators from states with large military bases often vote the same on appropriations impacting their constituency regardless of their party affiliation. More importantly, the party accepts such votes as necessary and does not demand loyalty unless it is necessary to pass or defeat important bills. Legislation comes in different types and each elicits its own voting strategies (the switching interpretation).

Implementations of Mixed Membership Models in R

One of the less demanding approaches is the grade of membership (GoM) model. The slides from April Galyardt's NIPS 2012 workshop illustrates the major points (see her dissertation for a more complete discussion). R implements the GoM model with the gom.em function in the package sirt. For a somewhat more general treatment, R offers a battery of IRT mixture model packages ( e.g., psychomix and mixRasch). However, nonnegative matrix factorization (NMF) accomplishes a similar mixed membership modeling simultaneously for both the rows and columns of a data matrix using only a decomposition procedure from linear algebra.

Simply put, NMF seeks a common set of latent variables to serve as the basis for both the rows and columns of a data matrix. In our roll call voting example, we might list the senators as the rows and the bills as the columns. Legislation that yielded pure party-line votes would be placed at the extremes of a latent representation that might be called party affiliation. Senators who always vote with their party would also be placed at the ends of a line separating these pure types. We might call this latent variable a dimension representing the Democrat-Republican continuum, although the distribution appears bimodal. Any two points define a line, so we can always infer a dimension even if there is little or no density except at the ends.

Some votes demand party loyalty, but other measures evoke a "protect-my-seat" response (e.g., any bill that helps a large industry or constituency in the senator's state). Such measures would move some senators away from the edges of the liberal-conservative divide as they switched voting strategies from party to state. Alternatively, the bill may elicit social versus economic concerns, or provoke nervousness concerning a primary challenge. Each voting strategy will group bills and cluster senators by generating latent basis vectors. You can think of a NMF as a joint factor analysis of the votes (columns) and cluster analysis of the senators (rows). Each latent variable is a voting strategy so that senators who switch strategies depending on the bill will have mixed memberships as will bills that can be voted for or against for different reasons.

Wednesday, April 29, 2015

Modeling the Latent Structure That Shapes Brand Learning

What is a brand? Metaphorically, the brand is the white sphere in the middle of this figure, that is, the ball surrounded by the radiating black cones. Of course, no ball has been drawn, just the conic thorns positioned so that we construct the sphere as an organizing structure (a form of reification in Gestalt psychology). Both perception and cognition organize input into Gestalts or Wholes generalizing previously learned forms and configurations.

It is because we are familiar with pictures like the following that we impose an organization on the black objects and "see" a ball with spikes. You did not need to be thinking about "spikey balls" for the figure recruits its own interpretative frame. Similarly, brands and product categories impose structure on feature sets. The brand may be an associative net (what comes to mind when I say "McDonald's"), but that network is built on a scaffolding that we can model using R.

In a previous post, I outlined how product categories are defined by their unique tradeoff of strengths and weaknesses. In particular, the features associated with fast food restaurants fall along a continuum from easy to difficult to deliver. Speed is achievable. Quality food seems to be somewhat harder to serve. Brands within the product category can separate themselves by offering their own unique affordances.

My example was Subway Sandwich offering "fresh" fast food. Respondents were given a list of 8 attributes (seating, menu selection, ease of ordering, food preparation, taste, filling, healthy and fresh) and asked to check off which attributes Subway successfully delivered.


The item characteristic curves in the above figure were generated using the R package ltm (latent trait modeling). The statistical model comes from achievement testing, which is why the attribute is called an item and the x-axis is labeled as ability. Test items can be arrayed in terms of their difficulty with the harder questions answered correctly only by the smarter students. Ability has been standardized, so the x-axis shows z-scores. The curves are the logistic functions displaying how the probability of endorsing each of the eight attributes is a function of each respondent's standing on the x-axis. Only 6 of the 8 attribute names can be seen on the chart with the labels for the lowest two curves, menu and seating, falling off the chart.

The zero point for ability is the average for the sample filling in the checklist. The S-curve for "fresh" has the highest elevation and is the farthest to the left. Reading up from ability equals zero, we can see that on the average more than 80% are likely to tell us that Subway serves fresh food. You can see the gap between the four higher curves for fresh, healthy, filling and taste and the four lower curves for preparation, ordering, menu and seating. The lowest S-curve indicates that the average respondent would check seating with a likelihood of less than 20%.

What is invariant is the checklist pattern. Those who love Subway might check all the attributes except for the bottom one or two. For example, families may be fine with everything but the available seating. On the other hand, those looking for a greasy hamburger might reluctantly endorse fresh or healthy and nothing else. As one moves from left to right along the Ability scale, the checklist is filled in with fresh, then healthy, and so on in an order reflecting the brand image. Fresh is easy for Subway. Healthy is only a little more difficult, but seating can be a problem. Moreover, an individual who is happy with the seating and the menu is likely to be satisfied with the taste and the freshness of the food. Response patterns follow an ordering that reflects the underlying scaffolding holding the brand concept together.

Latent trait or item response theory is a statistical model where we estimate the parameters of the equation specifying the relationship between the latent x-axis and response probability. R offers nonparametric alternatives such as KernSmoothIRT and princurve. Hastie's work on principal curves may be of particular interest since it comes from outside of achievement testing. A more general treatment of the same issues enables us to take a different perspective and see how observed responses are constrained by the underlying data generation process.

Branding is the unique spin added to a product category that has evolved out of repeated interactions between consumer needs and providers skill to satisfy those needs at a profit. The data scientist can see the impact of that branding when we model consumer perceptions. Consumers are not scientists running comparative tests under standardized conditions. Moreover, inferences are made when experience is lacking. Our take-out only customer feels comfortable commenting on seating, although they may have only glanced on their way in or out. It gets worse for ratings on scales and for attributes that the average consumer lacks the expertise to evaluate (e.g., credence attributes associated with quality, reliability or efficacy).

We often believe that product experience is self-evident and definitive when, in fact, it may be ambiguous and even seductive. Much of what we know about products, even products that we use, has been shaped by a latent structure learned from what we have heard and read. Even if the thorns are real, the scaffolding comes from others.

Wednesday, April 22, 2015

Conjoint Analysis and the Strange World of All Possible Feature Combinations

The choice modeler looks over the adjacent display of cheeses and sees the joint marginal effects of the dimensions spanning the feature space: milk source, type, origin, moisture content, added mold or bacteria, aging, salting, packaging, price, and much more. Literally, if products are feature bundles, then one needs to specify all the sources of variation generating so many different cheeses. Here are the cheeses from goats, sheep and cows. Some are local, and some are imported from different countries. In addition, we will require columns separating the hard and soft cheeses. The feature list can become quite long. In the end, one accounts for all the different cheeses with a feature taxonomy consisting of a large multidimensional space of all possible feature combinations. Every cheese falls into a single cell in the joint distribution, and the empty cells represent new product possibilities (unless the feature configuration is impossible).

The retailer, on the other hand, was probably thinking more of supply and demand when they filled this cooler with cheeses. It's an optimization problem that we can simplify as a tradeoff between losing customers because you do not have what they are looking for and losing money when the product spoils. Meanwhile, consumers have their own issues for they are buying for a reason and may infer a means to a desired end from individual features or complex combinations of transformed features. Neither the retailer nor the consumer is a naturalist seeking a feature taxonomy. In fact, except for the connoisseur, most consumers have very limited knowledge of any product category. We are simply not familiar with all the offerings nor could we name all the alternatives in the cheese cooler or the dog food aisle or the shelves filled with condensed soups. Instead, we rely on the physical or online displays to remind ourselves what is available, but even then, we do not consider every alternative or try to differentiate among all the products.

Thus, the conjoint world of all possible feature combinations is strange to a consumer who sees the products from a purposefully restricted perspective. The consumer categorizes products using goal-derived categories, for instance, restocking or running out of slices for your ham and Swiss sandwich. Thus, attention, categorization and preference are situated within the purchase context defined by goals and the purchase process including the physical product display (e.g., a deli counter with attendant is not the same as self-service selection of prepackaged products). Bartels and Johnson summarize this emerging view in their recent article "Connecting Cognition and Consumer Choice" (see Section 3 Learning and Constructing Value in Context).


Speaking of cheese (in particular, Brillat-Savarin cheese), we are reminded of the above quote popularized by the original Japanese Iron Chef TV show. Can it be this simple? I can tell you what is being discussed if you give me a "bag of words" and the R package topicmodels. R-bloggers shows how to recover the major cuisines from a list of ingredients from different recipes. My claim is that I learn a great deal by asking if you buy single wrapped slices of processed American cheese. As Charles de Gaulle quips, "How can you govern a country which has 246 varieties of cheese?" One can start by identifying the latent goal structure that shapes awareness, familiarity and usage.

Much is revealed by learning what music you listen to, your familiarity with various providers in a product category, which brands of Scotch whiskey you buy, or the food choices you make for breakfast. In each of those posts, the R package NMF was able to discover the underlying latent variables that could reproduce the raw data with many columns and most rows containing only a few responses (e.g., Netflix ratings with viewers in the rows seeing only a small proportion of all the movies in the columns). Nonnegative matrix factorization (NMF), however, is only one method for uncovering the hidden forces that structure consumption activities. You are free to select any latent variable model that can accommodate such high-dimensional sparse data (e.g., the already mentioned topic modeling, the R package HDclassif, the R package bclust, and more on the way). My preference for NMF stems from its ease of use and successful application across a diverse range of marketing research data as reported in prior posts.

Unfortunately, in the strange world of all possible feature combinations, consumers are unable to apply the strategies that work so well in the marketplace. Given nothing other than hypothetical products described by lists of orthogonal features, what else can a respondent do but rely on the information provided?