Wednesday, April 29, 2015

Modeling the Latent Structure That Shapes Brand Learning

What is a brand? Metaphorically, the brand is the white sphere in the middle of this figure, that is, the ball surrounded by the radiating black cones. Of course, no ball has been drawn, just the conic thorns positioned so that we construct the sphere as an organizing structure (a form of reification in Gestalt psychology). Both perception and cognition organize input into Gestalts or Wholes generalizing previously learned forms and configurations.

It is because we are familiar with pictures like the following that we impose an organization on the black objects and "see" a ball with spikes. You did not need to be thinking about "spikey balls" for the figure recruits its own interpretative frame. Similarly, brands and product categories impose structure on feature sets. The brand may be an associative net (what comes to mind when I say "McDonald's"), but that network is built on a scaffolding that we can model using R.

In a previous post, I outlined how product categories are defined by their unique tradeoff of strengths and weaknesses. In particular, the features associated with fast food restaurants fall along a continuum from easy to difficult to deliver. Speed is achievable. Quality food seems to be somewhat harder to serve. Brands within the product category can separate themselves by offering their own unique affordances.

My example was Subway Sandwich offering "fresh" fast food. Respondents were given a list of 8 attributes (seating, menu selection, ease of ordering, food preparation, taste, filling, healthy and fresh) and asked to check off which attributes Subway successfully delivered.


The item characteristic curves in the above figure were generated using the R package ltm (latent trait modeling). The statistical model comes from achievement testing, which is why the attribute is called an item and the x-axis is labeled as ability. Test items can be arrayed in terms of their difficulty with the harder questions answered correctly only by the smarter students. Ability has been standardized, so the x-axis shows z-scores. The curves are the logistic functions displaying how the probability of endorsing each of the eight attributes is a function of each respondent's standing on the x-axis. Only 6 of the 8 attribute names can be seen on the chart with the labels for the lowest two curves, menu and seating, falling off the chart.

The zero point for ability is the average for the sample filling in the checklist. The S-curve for "fresh" has the highest elevation and is the farthest to the left. Reading up from ability equals zero, we can see that on the average more than 80% are likely to tell us that Subway serves fresh food. You can see the gap between the four higher curves for fresh, healthy, filling and taste and the four lower curves for preparation, ordering, menu and seating. The lowest S-curve indicates that the average respondent would check seating with a likelihood of less than 20%.

What is invariant is the checklist pattern. Those who love Subway might check all the attributes except for the bottom one or two. For example, families may be fine with everything but the available seating. On the other hand, those looking for a greasy hamburger might reluctantly endorse fresh or healthy and nothing else. As one moves from left to right along the Ability scale, the checklist is filled in with fresh, then healthy, and so on in an order reflecting the brand image. Fresh is easy for Subway. Healthy is only a little more difficult, but seating can be a problem. Moreover, an individual who is happy with the seating and the menu is likely to be satisfied with the taste and the freshness of the food. Response patterns follow an ordering that reflects the underlying scaffolding holding the brand concept together.

Latent trait or item response theory is a statistical model where we estimate the parameters of the equation specifying the relationship between the latent x-axis and response probability. R offers nonparametric alternatives such as KernSmoothIRT and princurve. Hastie's work on principal curves may be of particular interest since it comes from outside of achievement testing. A more general treatment of the same issues enables us to take a different perspective and see how observed responses are constrained by the underlying data generation process.

Branding is the unique spin added to a product category that has evolved out of repeated interactions between consumer needs and providers skill to satisfy those needs at a profit. The data scientist can see the impact of that branding when we model consumer perceptions. Consumers are not scientists running comparative tests under standardized conditions. Moreover, inferences are made when experience is lacking. Our take-out only customer feels comfortable commenting on seating, although they may have only glanced on their way in or out. It gets worse for ratings on scales and for attributes that the average consumer lacks the expertise to evaluate (e.g., credence attributes associated with quality, reliability or efficacy).

We often believe that product experience is self-evident and definitive when, in fact, it may be ambiguous and even seductive. Much of what we know about products, even products that we use, has been shaped by a latent structure learned from what we have heard and read. Even if the thorns are real, the scaffolding comes from others.

Wednesday, April 22, 2015

Conjoint Analysis and the Strange World of All Possible Feature Combinations

The choice modeler looks over the adjacent display of cheeses and sees the joint marginal effects of the dimensions spanning the feature space: milk source, type, origin, moisture content, added mold or bacteria, aging, salting, packaging, price, and much more. Literally, if products are feature bundles, then one needs to specify all the sources of variation generating so many different cheeses. Here are the cheeses from goats, sheep and cows. Some are local, and some are imported from different countries. In addition, we will require columns separating the hard and soft cheeses. The feature list can become quite long. In the end, one accounts for all the different cheeses with a feature taxonomy consisting of a large multidimensional space of all possible feature combinations. Every cheese falls into a single cell in the joint distribution, and the empty cells represent new product possibilities (unless the feature configuration is impossible).

The retailer, on the other hand, was probably thinking more of supply and demand when they filled this cooler with cheeses. It's an optimization problem that we can simplify as a tradeoff between losing customers because you do not have what they are looking for and losing money when the product spoils. Meanwhile, consumers have their own issues for they are buying for a reason and may infer a means to a desired end from individual features or complex combinations of transformed features. Neither the retailer nor the consumer is a naturalist seeking a feature taxonomy. In fact, except for the connoisseur, most consumers have very limited knowledge of any product category. We are simply not familiar with all the offerings nor could we name all the alternatives in the cheese cooler or the dog food aisle or the shelves filled with condensed soups. Instead, we rely on the physical or online displays to remind ourselves what is available, but even then, we do not consider every alternative or try to differentiate among all the products.

Thus, the conjoint world of all possible feature combinations is strange to a consumer who sees the products from a purposefully restricted perspective. The consumer categorizes products using goal-derived categories, for instance, restocking or running out of slices for your ham and Swiss sandwich. Thus, attention, categorization and preference are situated within the purchase context defined by goals and the purchase process including the physical product display (e.g., a deli counter with attendant is not the same as self-service selection of prepackaged products). Bartels and Johnson summarize this emerging view in their recent article "Connecting Cognition and Consumer Choice" (see Section 3 Learning and Constructing Value in Context).


Speaking of cheese (in particular, Brillat-Savarin cheese), we are reminded of the above quote popularized by the original Japanese Iron Chef TV show. Can it be this simple? I can tell you what is being discussed if you give me a "bag of words" and the R package topicmodels. R-bloggers shows how to recover the major cuisines from a list of ingredients from different recipes. My claim is that I learn a great deal by asking if you buy single wrapped slices of processed American cheese. As Charles de Gaulle quips, "How can you govern a country which has 246 varieties of cheese?" One can start by identifying the latent goal structure that shapes awareness, familiarity and usage.

Much is revealed by learning what music you listen to, your familiarity with various providers in a product category, which brands of Scotch whiskey you buy, or the food choices you make for breakfast. In each of those posts, the R package NMF was able to discover the underlying latent variables that could reproduce the raw data with many columns and most rows containing only a few responses (e.g., Netflix ratings with viewers in the rows seeing only a small proportion of all the movies in the columns). Nonnegative matrix factorization (NMF), however, is only one method for uncovering the hidden forces that structure consumption activities. You are free to select any latent variable model that can accommodate such high-dimensional sparse data (e.g., the already mentioned topic modeling, the R package HDclassif, the R package bclust, and more on the way). My preference for NMF stems from its ease of use and successful application across a diverse range of marketing research data as reported in prior posts.

Unfortunately, in the strange world of all possible feature combinations, consumers are unable to apply the strategies that work so well in the marketplace. Given nothing other than hypothetical products described by lists of orthogonal features, what else can a respondent do but rely on the information provided?


Wednesday, April 15, 2015

Recommending Recommender Systems When Preferences Are Not Driven By Simple Features

Why does lifting out a slice make the pizza appear more appealing?
We can begin our discussion with the ultimate feature bundle - pizza toppings. Technically, a menu would only need to list all the toppings and allow the customers to build their own pizza. According to utility maximization, choice is a simple computation with the appeal of any pizza calculated as a function of the utilities associated with each ingredient. This is the conjoint assumption that the attraction to the final product can be reduced to the values of its features. Preference resides in the features, and the appeal of the product is derived from its particular configuration of feature levels. 

Yet, pizza menu design is an active area of discussion with many websites offering advice. In practice, listing ingredients does not generate the sales that are produced by the above photo or even a series of suggestive names with associated toppings. Perhaps this is why we see the same classic combinations offered around the world, as shown below in this menu from Bali.


I am not denying that a vegetarian does not want ham on their pizza. Ingredients do matter, but what is the learning process? Does preference formation begin with taste tests of the separate toppings heated in a very hot oven? "Oh, I like the taste of roasted black olives. Let me try that on my pizza." No, we learn what we like by eating pizzas with topping combinations that appear on menus because others before have purchased such configurations in sufficient quantity and at a profitable price for the restaurant. Moreover, we should not forget that we learn what we like in the company of others who are more than happy to tell us what they like and that we should like it too.

The expression "what grows together goes together" suggests that tastes are acquired initially by pairing what is locally available in abundance. It is the entire package that attracts us, which explains why pizza seems more appealing in the above photo. If we were of an experimental mindset, we might systematically vary the toppings one pizza at a time and reach some definitive conclusions concerning our tastes. However, it is more likely that consumers invent or repeat a rationale for their behavior based on minimal samplings. That is, consumer inference may function more like superstition than science, less like Galileo and more like an establishment with other interests to be served. At least we should consider the potential impact of the preference formation process and keep our statistical models open to the possibility that consumer go beyond simple features when they represented products in thought and memory (see Connecting Cognition and Consumer Choice).

Simply stated, you get what you ask. Features become the focus when all you have are choice sets where the alternatives are the cells of an optimal experimental design. Clearly, the features are changing over repeated choice sets and the consumer responds to such manipulations. If we are careful and mimic the marketplace, our choice modeling will be productive. I have tried to illustrate in previous posts how R can generate choice designs and provide individual-level hierarchical Bayes estimates along with some warnings about overgeneralizing. For example, choice modeling works just fine when the purchase context is label comparisons among a few different products sitting on the retail shelf. 

Instead, what if we showed actual pizza menus from a large number of different restaurants? Does this sound familiar? What is we replace pizzas with movies or songs? This is the realm of recommender systems where the purchase context is filled with lots of alternatives arrayed across a landscape of highly correlated observed features generated by a much smaller set of latent features. We have entered the world of representation learning. Choice modeling delivers when the purchase context requires that we compare products and trade off features. Recommendation systems, on the other hand, thrive when there is more available than any one person can experience (e.g., the fragmented music market).

To be clear, I am using the term "recommendation systems" because they are familiar and we all use them when we shop online or search the web. Actually, any representational system that estimates missing values by adding latent variables will do nicely (see David Blei for a complete review). However, since reading a description of the computation behind the winning of the Netflix Prize, I have relied on matrix factorization as a less demanding approach that still yields substantial insight with marketing research data. In particular, the NMF package in R offers such an easy to use interface to nonnegative matrix factorization that I have used it over and over again with repeated success. You will find a list of such applications at the end of my post on Brand and Product Category Representation.






Friday, April 10, 2015

Modeling Categories with Breadth and Depth

Religion is a categorical variable with followers differentiated by their degree of devotion. Liberals and conservatives check their respective boxes when surveyed, although moderates from each group sometimes seem more alike than their more extreme compatriots. All Smartphone users might be classified as belonging to the same segment, yet the infrequent user is distinct from the intense who cannot take their eyes off their screens. Both of us have the flu, but you are really sick. My neighbor and I belong to a cluster called dog-owners. However, my dog is a pet on a strict allowance and theirs is a member of the family with no apparent limit on expenditures. There seems to be more structure underlying such categorizations than is expressed by a zero-one indicator of membership.

Plutchik's wheel of emotion offers a concrete example illustrating the breadth and depth of affective categories spanning the two dimensions defined by positive vs. negative and active vs. passive. The concentric circles in the diagram below show the depth of the emotion with the most intense toward the center. Loathing, disgust and boredom suggest an increasing activation of a similar type of negative response between contempt and remorse. Breadth is varied as one moves around the circle traveling through the entire range of emotions. When we speak of opposites such as flight (indicated in green as apprehension, fear and terror) or fight (the red emotions of annoyance, anger and rage), we are relating our personal experiences with two contrasting categories of feelings and behaviors. Yet, there is more to this categorization than is express by a set of exhaustive and mutually exclusive boxes, even if the boxes are called latent classes.



The R statistical language approaches such structured categories from a number of perspectives. I have written two posts on archetypal analysis. The first focused on the R package archetypes and demonstrated how to use those functions to model customer heterogeneity by identifying the extremes and representing everyone else as convex combinations of those end points. The second post argued that we tend to think in terms of contrasting ideals (e.g., liberal versus conservative) so that we perceive entities to be more different in juxtaposition than they would appear on their own.

Latent variable mixture models provide another approach to the modeling of unobserved constructs with both type and intensity, at least for those interested in psychometrics in R. The intensity measure for achievement tests is item difficulty, and differential item functioning (DIF) is the term used by psychometricians to describe items with different difficulty parameters in different groups of test takers. The same reasoning applies to customers who belong to different segments (types) seeking different features from the same products (preference intensity). These are all mixture models in the sense that we cannot estimate one set of parameters for the entire sample because the data are a mixture of hidden groups with different parameters. R packages with the prefix or suffix "mix" in their title (e.g., mixRasch or flexmix) suggest such a mixture modeling.

R can also capture the underlying organization shaping categories through matrix factorization. This blog is filled with posts demonstrating how the R package NMF (nonnegative matrix factorization) easily decomposes a data matrix into the product of two components: (1) something that looks like factor loadings of the measures in the columns and (2) something similar to a soft or fuzzy clustering of respondents in the rows. Both these components will be block diagonal matrices when we have the type of separation we have been discussing. You can find examples of such matrix decompositions in a listing of previous posts at the end of my discussion of brand and product representation.

Consumer segments are structured by their use of common products and features in order to derive similar benefits. This is not an all-or-none process but a matter of degree specified by depth (e.g., usage frequency) and breadth (e.g., the variety and extent of feature usage). You can select almost any product category, and at one end you will find heavy users doing everything that can possibly be done with the product. As usage decreases, it falls off in clumps with clusters of features no longer wanted or needed. These are the latent features of NMF that simultaneously bind together consumers and the features they use. For product categories such as movies or music, the same process applies but now the columns are films seen or recordings heard. All of this may sound familiar to those of you who have studied recommendation systems or topic modeling, both of which can be run with NMF.