Friday, March 25, 2016

Choice Modeling with Features Defined by Consumers and Not Researchers

Choice modeling begins with a researcher "deciding on what attributes or levels fully describe the good or service." This is consistent with the early neural networks in which features were precoded outside of the learning model. That is, choice modeling can be seen as learning the feature weights that recognize whether the input was of type "buy" or not.

As I have argued in the previous post, the last step in the purchase task may involve attribute tradeoffs among a few differentiating features for the remaining options in the consideration set. The aging shopper removes two boxes of cereal from the well-stocked supermarket shelves and decides whether low-sodium beats low-fat. The choice modeler is satisfied, but the package designer wants to know how these two boxes got noticed and selected for comparison. More importantly for the marketer, how is the purchase being framed by the consumer? Is it advertising that focused attention on nutrition? Was it health claims by other cereal boxes nearby on the same shelf?

With caveats concerning the need to avoid caricature, one can describe this conflict between the choice modeler and the marketer in terms of shallow versus deep learning (see slide #2 from Yann LeCun's 2013 tutorial with video here). From this perspective, choice modeling is a form of  more shallow information integration where the features are structured (varied according to some experimental design) and presented in a simplified format (the R package support.CEs aids in this process and you can find R code for hierarchical Bayes using bayesm in this link).

Choice modeling or information integration is illustrated on the upper left of the above diagram. The capital S's are the attribute inputs that are translated into utilities so that they can be evaluated on a common value scale. Those utilities are combined or integrated and yield a summary measure that determines the response. For example, if low-fat were worth two units and low-sodium worth only one unit, you would buy the low-fat cereal. The modeling does not scale well, so we need to limit the number of feature levels. Moreover, in order to obtain individual estimates, we require repeated measures from different choice sets. The repetitive task encourages us to streamline the choice sets so that feature tradeoffs are easier to see and make. The constraints of an experimental design force us toward an idealized presentation so that respondents have little choice but information integration.

Deep learning, on the other hand, has multiple hidden layers that model feature extraction by the consumer. The goal is to eat a healthy cereal that is filling and tastes good. Which packaging works for you? Does it matter if the word "fiber" is included? We could assess the impact of the fiber labeling by turning it on and off in an experimental design. But that only draws attention to the features that are varied and limits any hope of generalizing our findings beyond the laboratory. Of course, it depends on whether you are buying for an adult or a child, and whether the cereal is for breakfast or a snack. Contextual effects force us to turn to statistical models that can handle the complexities of real world purchase processes.

R does offer an interface to deep learning algorithms. However, you can accomplish something similar with nonnegative matrix factorization (NMF). The key is not to force a basis onto the statistical analysis. Specifically, choice modeling relies on a regression analysis with the features as the independent variables. We can expand this basis by adding transformations of the original features (e.g., the log of price or inserting polynomial expansions of variables already in the model). However, the regression equation will reveal little if the consumer infers some hidden or latent features from a particular pattern of feature combinations (e.g., a fragment of the picture plus captions along with the package design triggers childhood memories or activates aspirational drives).

Deep learning excels with the complexities of language and vision. NMF seems to work well in the more straightforward world of product preference. As an example, Amazon displays several thousand cereals that span much of what is available in the marketplace. We can limit ourselves to a subset of the 100 or more most popular cereals and ask respondents to indicate their interest in each cereal. We would expect a sparse data matrix with blocks of joint groupings of both respondents with similar tastes and cereals with similar features (e.g., variation on flakes, crunch or hot cereals). The joint blocks define the hidden layers simultaneously clustering respondents and typing products.

Matrix factorization or decomposition seeks to reconstruct the data in a matrix from a smaller number of latent features. I have discussed its relationship to deep learning in a post on product category representation. It ends with a listing of examples that include the code needed to run NMF in R. You can think of NMF as a dual factor analysis with a common set of factors for both rows (consumers) and columns (cereals in this case). Unlike principal component or factor analysis, there are no negative factor loadings, which is why NMF is nonnegative. The result is a data matrix reconstructed from parts that are not imposed by the statistician but revealed in the attempt to reproduce the consumer data.

We might expect to find something similar to what Jonathan Gutman reported from a qualitative study using a means-end analysis. I have copied his Figure 3 showing what consumers said when asked about crunchy cereals. Of course, all we obtain from our NMF are weights that look like factor loadings for respondents and cereals. If there is a crunch factor, you will see all the cereals with crunch loading on that hidden feature with all the respondents wanting crunch with higher weights on the same hidden feature. Obviously, in order to know which respondents wanted something crunchy in their cereal, you would need to ask a separate question. Similarly, you might inquire about cereal perceptions or have experts rate the cereals to know which cereals produce the biggest crunch. Alternatively, one could cluster the respondents and cereals and profile those clusters.

Monday, March 21, 2016

Understanding Statistical Models Through the Datasets They Seek to Explain: Choice Modeling vs. Neural Networks

R may be the lingua franca, yet many of the packages within the R library seem to be written in different languages. We can follow the R code because we know how to program but still feel that we have missed something in the translation.

R provides an open environment for code from different communities, each with their own set of exemplars, where the term "exemplar" has been borrowed from Thomas Kuhn's work on normal science. You need only to examine the datasets that each R package includes to illustrate its capabilities in order to understand the diversity of paradigms spanned. As an example, the datasets from the Clustering and Finite Mixture Task View demonstrate the dependence of the statistical models on the data to be analyzed. Those seeking to identifying communities in social networks might be using similar terms as those trying to recognize objects in visual images, yet the different referents (exemplars) change the meanings of those terms.

Thinking in Terms of Causes and Effects

Of course, there are exceptions, for instance, regression models can be easily understood across applications as the "pulling of levers" especially for those of us seeking to intervene and change behavior (e.g., marketing research). Increased spending on advertising yields greater awareness and generates more sales, that is, pulling the ad spending lever raises revenue (see the R package CausalImpact). The same reasoning underlies choice modeling with features as levers and purchase as the effect (see the R package bayesm).

The above picture captures this mechanistic "pulling the lever" that dominates much of our thinking about the marketing mix. The exemplar "explains" through analogy. You might prefer "adjusting the dials" as an updated version, but the paradigm remains cause-and-effect with each cause separable and under the control of the marketer. Is this not what we mean by the relative contribution of predictors? Each independent variable in a regression equation has its own unique effect on the outcome. We pull each lever a distance of one standard deviation (the beta weight), sum the changes on the outcome (sometimes theses betas are squared before adding), and then divide by the total.

The Challenge from Neural Networks

So, how do we make sense of neural networks and deep learning? Is the R package neuralnet simply another method for curve fitting or estimating the impact of features? Geoffrey Hinton might think differently. The Intro Video for Coursera's Neural Networks for Machine Learning offers a different exemplar - handwritten digit recognition. If he is curve fitting, the features are not given but extracted so that learning is possible (i.e., the features are not obvious but constructed from the input to solve the task at hand). The first chapter of Michael Nielsen's online book, Using Neural Nets to Recognize Handwritten Digits, provides the details. Isabelle Guyon's pattern recognition course adds an animated gif displaying visual perception as an active process.

On the other hand, a choice model begins with the researcher deciding what features should be varied. The product space is partitioned and presented as structured feature lists. What alternative does the consumer have, except to respond to variations in the feature levels? I attend to price because you keep changing the price. Wider ranges and greater variation only focus my attention. However, in real setting the shelves and the computer screens are filled with competing products waiting for consumers to define their own differentiating features. Smart Watches from Google Shopping provides a clear illustration of the divergence of purchase processes in the real world and in the laboratory.

To be clear, when the choice model and the neural network speak of input, they are referring to two very different things. The exemplars from choice modeling are deciding how best to commute and comparing a few offers for same product or service. This works when you are choosing between two cans of chicken soup by reading the ingredients on their labels. It does not describe how one selects a cheese from the huge assortment found in many stores.

Neural networks take a different view of the task. In less than five minutes Hinton's video provides the exemplar for representation learning. Input enters as it does in real settings. Features that successfully differentiate among the digits are learned over time. We see that learning in the video when the neural net generates its own handwritten digits for the numbers 2 and 8. It is not uncommon to write down a number that later we or others have difficulty reading. Legibility is valued so that we can say that an easier to read "2" is preferred over a "2" that is harder to identify. But what makes one "2" a better two than another "2" takes some training, as machine learning teaches us.

We are all accomplished at number recognition and forget how much time and effort it took to reach this level of understanding (unless we know young children in the middle of the learning process). What year is MCMXCIX? The letters are important, but so are their relative positions (e.g. X=10 and IX=9 in the year 1999). We are not pulling levers any more, at least not until the features have been constructed. What are those features in typical choice situations? What you want to eat for breakfast, lunch or dinner (unless you snack instead) often depends on your location, available time and money, future dining plans, motivation for eating, and who else is present (context-aware recommender systems).

Adopting a different perspective, our choice modeler sees the world as well-defined and decomposable into separate factors that can be varied systematically according to some experimental design. Under such constraints the consumer behaves as the model predicts (a self-fulling prophecy?). Meanwhile, in the real world, consumers struggle to learn a product representation that makes choice possible.

Thinking Outside the Choice Modeling Box

The features we learn may be relative to the competitive set, which is why adding a more expensive alternative makes what is now the mid-priced option appear less expensive. Situation plays an important role for the movie I view when alone is not the movie I watch with my kids. Framing has an impact, which is why advertising tries to convince you that an expensive purchase is a gift that you give to yourself. Moreover, we cannot forget intended usage for that Smartphone is a camera, a GPS, and I believe you get the point. We may have many more potential features than included in our choice design.

It may be the case that the final step before purchase can be described as a tradeoff among a small set of features varying over only a few alternatives in our consideration set. If we can mimic that terminal stage with a choice model, we might have a good chance to learn something about the marketplace. How did the consumer get to that last choice point? Why these features and those alternative products or services? In order to answer such questions, we will need to look outside the choice modeling box.