Wednesday, April 15, 2015

Recommending Recommender Systems When Preferences Are Not Driven By Simple Features

Why does lifting out a slice make the pizza appear more appealing?
We can begin our discussion with the ultimate feature bundle - pizza toppings. Technically, a menu would only need to list all the toppings and allow the customers to build their own pizza. According to utility maximization, choice is a simple computation with the appeal of any pizza calculated as a function of the utilities associated with each ingredient. This is the conjoint assumption that the attraction to the final product can be reduced to the values of its features. Preference resides in the features, and the appeal of the product is derived from its particular configuration of feature levels. 

Yet, pizza menu design is an active area of discussion with many websites offering advice. In practice, listing ingredients does not generate the sales that are produced by the above photo or even a series of suggestive names with associated toppings. Perhaps this is why we see the same classic combinations offered around the world, as shown below in this menu from Bali.

I am not denying that a vegetarian does not want ham on their pizza. Ingredients do matter, but what is the learning process? Does preference formation begin with taste tests of the separate toppings heated in a very hot oven? "Oh, I like the taste of roasted black olives. Let me try that on my pizza." No, we learn what we like by eating pizzas with topping combinations that appear on menus because others before have purchased such configurations in sufficient quantity and at a profitable price for the restaurant. Moreover, we should not forget that we learn what we like in the company of others who are more than happy to tell us what they like and that we should like it too.

The expression "what grows together goes together" suggests that tastes are acquired initially by pairing what is locally available in abundance. It is the entire package that attracts us, which explains why pizza seems more appealing in the above photo. If we were of an experimental mindset, we might systematically vary the toppings one pizza at a time and reach some definitive conclusions concerning our tastes. However, it is more likely that consumers invent or repeat a rationale for their behavior based on minimal samplings. That is, consumer inference may function more like superstition than science, less like Galileo and more like an establishment with other interests to be served. At least we should consider the potential impact of the preference formation process and keep our statistical models open to the possibility that consumer go beyond simple features when they represented products in thought and memory (see Connecting Cognition and Consumer Choice).

Simply stated, you get what you ask. Features become the focus when all you have are choice sets where the alternatives are the cells of an optimal experimental design. Clearly, the features are changing over repeated choice sets and the consumer responds to such manipulations. If we are careful and mimic the marketplace, our choice modeling will be productive. I have tried to illustrate in previous posts how R can generate choice designs and provide individual-level hierarchical Bayes estimates along with some warnings about overgeneralizing. For example, choice modeling works just fine when the purchase context is label comparisons among a few different products sitting on the retail shelf. 

Instead, what if we showed actual pizza menus from a large number of different restaurants? Does this sound familiar? What is we replace pizzas with movies or songs? This is the realm of recommender systems where the purchase context is filled with lots of alternatives arrayed across a landscape of highly correlated observed features generated by a much smaller set of latent features. We have entered the world of representation learning. Choice modeling delivers when the purchase context requires that we compare products and trade off features. Recommendation systems, on the other hand, thrive when there is more available than any one person can experience (e.g., the fragmented music market).

To be clear, I am using the term "recommendation systems" because they are familiar and we all use them when we shop online or search the web. Actually, any representational system that estimates missing values by adding latent variables will do nicely (see David Blei for a complete review). However, since reading a description of the computation behind the winning of the Netflix Prize, I have relied on matrix factorization as a less demanding approach that still yields substantial insight with marketing research data. In particular, the NMF package in R offers such an easy to use interface to nonnegative matrix factorization that I have used it over and over again with repeated success. You will find a list of such applications at the end of my post on Brand and Product Category Representation.


  1. Hi Joel: Your blog is always interesting. Thanks for it. I just have a question. I did a little work a few years back with matrix factorization-recommender systems in R. The short of it was took Andre NG's class and then found that estimating recommender systems in R was not trivial on medium-large movie lense data sets. . Results
    were heavily dependent on which algorithm was used and some algorithms were quite slow and some couldn't arrive at any solution
    and some were different.

    So, then I looked I was trying to seeing what the latest and greatest
    methods were ( literature is enormous ) whether there was a way to implment an SVT
    ( singular value thresholding ) type of approach. I plowed through
    the literature but, in the end, got stuck working on the SVT implementation. At that time, Hastie and Tibishrani et al submitted a package ( name escapes me at the moment ) that I think sort of does what I just had a dream of doing. so I stopped and left it hanging.

    But my question is, is there some upside to doing NON NEGATIVE
    matrix factorization. Does the restriction make sense in the context
    of recommendation systems. I was never clear on why people
    tended to use that approach specifically ? and the netflix algorithm literature that went through generally didn't talk about the non-negative approach. Maybe the algorithm is easier ? I have no idea and was just curious about that. Thanks.

    1. One can think of nonnegative matrix factorization (NMF) as principal component analysis or singular value decomposition (SVD) with all the loadings and scores forced to be nonnegative. Nothing ever gets subtracted, so there are no bipolar latent variables for the features and no negative latent variable scores for the respondents. The whole is the sum of the parts, which yields a set of latent variables that does not need rotation to simple structure because the loadings are already sparse. All this depends on separation with different consumer communities focusing on different features (e.g., thin crust seekers do not want all the toppings desired by deep dish lovers). The R package NMF does all the work in a couple of lines of code and helps with the interpretation by including heatmaps. Given all the research that you have already done, you should find my links to NMF posts easy to follow and well worth your time.