Engaging Market Research: May 2013

Almost once every year someone asks if R has a package for running the MaxDiff procedure sold by Sawtooth. One such inquiry recently received a reply with a link showing in some detail the R code needed to generate a balanced incomplete block design, input the best-worst choice data, and use the mlogit R package to estimate the parameters. Although Sawtooth uses hierarchical Bayes and its own peculiar program for showing individuals an unbalanced subset of the block design, the link provides a considerable amount of helpful R code that takes one some distance toward understanding how MaxDiff works. For example, it makes clear that MaxDiff relies on a "trick" to combine the best and worst choice data into one single analysis. Sawtooth assumes that there is one set of preference parameters underlying the choice of both the best and the worst alternatives. Thus, we can estimate just one set of common parameters for the combined best and worst choices if we multiply all the independent variables by -1 when the dependent variable is the worst choice.

This assumption is questioned in the upcoming Advanced Research Techniques Forum. One of the papers, Models of Sequential Evaluation in Best-Worst Choice Tasks, raises serious concerns. The authors present convincing evidence that respondents make sequential choices and that usually the worst alternative is selected first and then the best selected from the remaining alternatives. The conclusion is that we need different best and worst choice parameters, and as a result, Sawtooth's MaxDiff analysis can be misleading. Moreover, the authors note that their findings support the idea of preference construction. That is, pre-existing preferences are not revealed in the choice task, but instead, preferences are created in order to solve the choice task. The paper ends with the following sentence: "We think that it is very important in the model development stage to stop and think about whether the tool that is used to collect data on preferences makes sense and is consistent with anticipated decision processes on actual purchasing decisions."

So we have another reason not to add a MaxDiff package to the R library. However, I do not believe that the paper takes its argument to its logical conclusion. The example from their analysis is a study of hair concerns. They ask respondents to indicate their most and least concern in choice sets of 5 alternatives chosen from the complete list of 15 concerns in all. For example, which of the following items are you the most and least concerned about?

My hair is coming out more than it used to
My graying hair is unflattering
My hair is dry
I have unruly, unmanageable hair
My hair is stiff and resistant

Now I ask the reader, when does anyone need to make such a choice in the real world? The authors admonish us to "stop and think about whether the tool used to collect data on preferences makes sense." What if my hair is unmanageable because it is dry? Trade-offs work when they are forced upon us by real constraints in the world. When trade-offs are artificial, they persuade us to attend to features and give weight to information that we would not have considered during the actual purchase process.

Now, if preference were well-formed and established, then perhaps such an unrealistic choice task would still reveal my "true" preferences. But if preferences are constructed in order to solve the choice task at hand, then little that I learn in this study can be generalized to the marketplace. Context and order effects are robust and common in choice modeling because we do not store fully-formed preferences for the thousands of features that we will encounter given all that we purchase over time. Instead, we use the choice task to assist us in the construction of preference for the particular situation in which we find ourselves (situated cognition). Only when the choice tasks mimics the real world can we learn anything of value. Even then, we may have problems because experiments and data collection are always obtrusive.

A previous post, Incorporating Preference Construction into the Choice Modeling Process, provides a more detailed discussion of these issues.

Survey research deals with the problem of question wording by always asking the same question. Thus, the Gallup Daily Tracking is filled with examples of moving averages for the exact same question asked precisely the same way every day. The concern is that even small changes in the wording can have an impact on how people perceive and respond to the question. The question itself becomes the construct and our sole focus.

Yet, such a practice severely limits how much marketing research can learn about its customers. Think of all the questions that a service provider, such as a mobile phone company, would want to ask its customers about both their personal and work-related usage: voice, text, and data performance, reliability, quality, speed, internet, email, product features, cost, billing, support, customer service, bundling, multitasking, upgrades, promotions, and much more. Moreover, if we want actionable findings, we must "drill" down into each of these categories seeking information about specific attributes and experiences. Unfortunately, we never reach the level of specificity needed for action. The respondent burden quickly becomes excessive because we believe that every respondent must be shown every item. Our fallback position is a reliance on summary evaluative judgments, for example, satisfaction ratings asked at a level of abstraction far removed from actual customer experiences. We find the same constraints in many customer surveys, whether those questionnaires ask about usage, performance, perceptions, or benefits.

What if we did not require every respondent to complete every item? Obviously, we could no longer compare respondents question by question. This would be a problem only if each question were unique and irreplaceable. But few believe this to be the case. Most of us accept some version of latent variable modeling and hold that answers to questions are observed or manifest indicators of a smaller number of latent constructs. In educational testing, for example, we have an item pool from which we can sample individual items. Actually, with computer adaptive testing one can tailor the item selection process to match the respondent's ability level. If we have many items measuring the same construct, then we are free to sample items as we would respondents. Generalizability theory is a formal expression of this concept. The construct should be our focus. The individual ratings serve a secondary role as indicators.

In a previous post on reducing respondent burden, I outlined an approach using item response theory as the basis for item sampling. If you recall, I used a short airline satisfaction questionnaire with 12 items as my example. The ratings were the usual summary evaluative judgments measuring performance on the typical five-point scale with only the endpoints labeled with the words "poor" and "excellent." As I showed in another post, such rating scales tend to yield a strong first principal component or halo effect. When the ratings are context-free, respondents can answer without needing to recall any encounter or usage occasion. However, this is precisely what we are trying to avoid by using ratings of specific experiences or needs in the actual usage context.

For example, I might be reluctant to ask you to rate the importance of call quality without specifying explicit criterion because it would require far too much respondent interpretation. Without specificity, the respondent is forced to "fill-in-the-blanks" and the researcher has no idea what the respondent was thinking when they responded to the question. Instead, I would ask about dropping calls or needing to repeat what was said to a client while traveling on business. We are not as likely to find a single dimension underlying ratings at such high levels of specificity. Multidimensional item response theory (mirt) is one alternative, but I would like to suggest the R package softImpute as a flexible and comprehensive solution to the matrix completion problem. Perhaps we should visit the machine learning lab to learn more about matrix completion and recommender systems.

The Matrix Completion Problem and Recommender Systems

In 2006 Netflix released a data set with almost 500,000 respondents (rows) and almost 18,000 movies (columns). The entries were movie ratings on a scale from 1 to 5 stars, although close to 99% of the matrix was empty. The challenge was to complete the matrix or fill-in the missing values, since an individual rates only a small proportion of all the movies. That is, in order to make personal recommendations for an individual respondent, we need to estimate what their ratings would have been for all the movies they did not rate. One solution was to turn to matrix factorization and take advantage of the structure underlying both movies and viewers preferences. Matrix completion can be run in R using the new package softImpute.

Although the mathematics behind softImpute can be challenging, the rationale should sound familiar. Movies are not unique. You can see this clearly in the maps from the earlier link to matrix factorization. If nothing else, genre creates similarity, as does leading actor and director. When asked to sort movies based on their similarity, most viewers tend to make the same kinds of distinctions using many of the same features. In fact, we could create a content-based recommender system based solely on movie similarity.

Moreover, customers are not unique. The same features that we use to describe similarity among movies will serve to define preference heterogeneity among respondents. This reciprocity flows from the mutual co-evolution of movies and viewer preferences. One successful film release gives rise to many clones, which are seen by the same viewers because they are "like" the original movie. Consequently, these large matrices are not as complicated as they seem at first. They appear to be of high dimension, but they have low rank, or alternatively we might say that the observed ratings can be explained by a much smaller number of latent variables. An example from marketing research might help.

My example is a study where 1603 users were persuaded with incentives to complete a long battery of 165 importance ratings. The large number of ratings reflected the client's need to measure interest in specific features and services in actual usage occasions. The complete matrix contains 1603 rows and 165 columns with no missing data. As you can see in the scree plot below, the first principal component accounted for a considerable amount of the total variation (35%). There is a clear elbow, although the eigenvalues are still greater than one for the first twenty or so components. The substantial first principal component reflects the considerable heterogeneity we observe among users. This is common for product categories with customers who run the spectrum from basic users with limited needs to premium users wanting every available feature and service.

To illustrate how the R package softImpute works, I will discard 85% of the observed ratings. That is, I will randomly select for each respondent 140 of the 165 ratings and set these ratings to missing (NA). Thus, the complete matrix had 165 ratings for every respondent, and the matrix to be completed has 25 randomly selected ratings for each respondent.

In order to keep this analysis simple, I will use the defaults from softImpute R package. The function is also called softImpute(), and it requires only a few arguments. It needs a data matrix, so I have used the function as.matrix() to convert my data frame into a matrix. It needs a rank.max value to restrict the rank of the solution, which I gave a value of 25 because of the location of the elbow in the scree plot. And finally the function requires a value for lambda, the nuclear-norm regularization parameter. Unfortunately, it would take some time to discuss regularization and how best to set the value of this parameter. It is an important issue, as is the question of the conditions under which matrix completion might or might not be successful. But you will have to wait for a later post. For this analysis, I have set lambda to 15 because the authors of softImpute, Hastie and Mazumder, recommend that lambda should be slightly less than rank.max and most of the examples in the manual set lambda to value of about 60% of the size of rank.max. As you will see, these values appear to do quite well, so I accepted this solution.

fit <- softImpute(importance_missing, rank.max=25, lambda=15)
importance_impute<-complete(importance_missing, fit)

This is the R-code needed to run softImpute. The matrix with the missing importance ratings is called importance_missing. The object called importance_impute is a 1603 x 165 completed matrix with the missing value replaced by "recommendations" or imputed values.

How good is our completed matrix? At the aggregate level we are able to reproduce the item means almost perfectly, as you can see from the plot below. The x-axis shows the item means from the completed matrix after the missing values have been imputed. The item means calculated from the complete data matrix are the y-values that we want to reproduce. The high R-squared tells us that we learn a considerable amount about the importance of the individual items even when each respondent is shown only a small randomly selected subset of the item pool.

It is useful to note the spread in the mean item ratings across the 7-point scale. We do not see the typical negative skew with a large proportion of respondents falling in the top-box of the rating scale, which is so common when importance is measured using a smaller number of more abstract statements. Of course, this was one of the reasons our client wanted to get "specific" and needed to ask so many items.

As might be expected, the estimation is not as accurate at the level of the individual respondent. We started with complete data, randomly sampled 15% of the items, and then completed the matrix using softImpute. The average correlation between the original and imputed ratings for individual respondents across the 165 items was 0.65. Given that we deleted 85% of each respondent's ratings, an average R-squared of 0.42 seems acceptable. But is it important to reproduce every respondent's rating to every question? If this were an achievement test, we would not be concerned about the separate items. Individual respondent ratings on each item contain far too much noise for this level of analysis.

On the other hand, the strong first principal component that I reported earlier in this post justifies the calculation of a total score for each respondent, and our imputed total score performs well as a predictor of the total score calculated from the complete data matrix (R-squared = 0.93). In addition to the total score, it is reasonable to expect to find some specific factors or latent variables from which we can calculate factor scores. I am speaking of a bifactor structure where covariation among the items can be explained by the simultaneously effects of a general factor plus one or more specific factors. I was able to identify the same four specific factors for the complete data and the imputed data from the softImpute matrix completion. The average correlation among the four factor scores was 0.79, suggesting that we were able to recover the underlying factor structure even when 85% of the data were missing.

So it seems that we can have it all. We can have long batteries of detailed questions that require customers to recall their actual experiences with specific features and services in real usage occasions. In fact, given an underlying low-dimensional factor structure, we need randomly sample only a small fraction of the item pool separately for each respondent. Matrix factorization will uncover that structure and use it to fill-in the items that were not rated and complete the data matrix. Redundancy is a wonderful thing. Use it!

Engaging Market Research

Pages

Tuesday, May 28, 2013

Why doesn't R have a MaxDiff package?

Monday, May 6, 2013

Incomplete Data by Design: Bringing Machine Learning to Marketing Research