Monday, May 6, 2013

Incomplete Data by Design: Bringing Machine Learning to Marketing Research

Survey research deals with the problem of question wording by always asking the same question.  Thus, the Gallup Daily Tracking is filled with examples of moving averages for the exact same question asked precisely the same way every day.  The concern is that even small changes in the wording can have an impact on how people perceive and respond to the question.  The question itself becomes the construct and our sole focus.

Yet, such a practice severely limits how much marketing research can learn about its customers.  Think of all the questions that a service provider, such as a mobile phone company, would want to ask its customers about both their personal and work-related usage:  voice, text, and data performance, reliability, quality, speed, internet, email, product features, cost, billing, support, customer service, bundling, multitasking, upgrades, promotions, and much more.  Moreover, if we want actionable findings, we must "drill" down into each of these categories seeking information about specific attributes and experiences. Unfortunately, we never reach the level of specificity needed for action.  The respondent burden quickly becomes excessive because we believe that every respondent must be shown every item.  Our fallback position is a reliance on summary evaluative judgments, for example, satisfaction ratings asked at a level of abstraction far removed from actual customer experiences.  We find the same constraints in many customer surveys, whether those questionnaires ask about usage, performance, perceptions, or benefits.

What if we did not require every respondent to complete every item?  Obviously, we could no longer compare respondents question by question.  This would be a problem only if each question were unique and irreplaceable.  But few believe this to be the case.  Most of us accept some version of latent variable modeling and hold that answers to questions are observed or manifest indicators of a smaller number of latent constructs.  In educational testing, for example, we have an item pool from which we can sample individual items.  Actually, with computer adaptive testing one can tailor the item selection process to match the respondent's ability level.  If we have many items measuring the same construct, then we are free to sample items as we would respondents.  Generalizability theory is a formal expression of this concept.  The construct should be our focus.  The individual ratings serve a secondary role as indicators.

In a previous post on reducing respondent burden, I outlined an approach using item response theory as the basis for item sampling.  If you recall, I used a short airline satisfaction questionnaire with 12 items as my example.  The ratings were the usual summary evaluative judgments measuring performance on the typical five-point scale with only the endpoints labeled with the words "poor" and "excellent."  As I showed in another post, such rating scales tend to yield a strong first principal component or halo effect.  When the ratings are context-free, respondents can answer without needing to recall any encounter or usage occasion.  However, this is precisely what we are trying to avoid by using ratings of specific experiences or needs in the actual usage context.

For example, I might be reluctant to ask you to rate the importance of call quality without specifying explicit criterion because it would require far too much respondent interpretation.  Without specificity, the respondent is forced to "fill-in-the-blanks" and the researcher has no idea what the respondent was thinking when they responded to the question.  Instead, I would ask about dropping calls or needing to repeat what was said to a client while traveling on business.  We are not as likely to find a single dimension underlying ratings at such high levels of specificity.  Multidimensional item response theory (mirt) is one alternative, but I would like to suggest the R package softImpute as a flexible and comprehensive solution to the matrix completion problem.  Perhaps we should visit the machine learning lab to learn more about matrix completion and recommender systems.

The Matrix Completion Problem and Recommender Systems

In 2006 Netflix released a data set with almost 500,000 respondents (rows) and almost 18,000 movies (columns).   The entries were movie ratings on a scale from 1 to 5 stars, although close to 99% of the matrix was empty.  The challenge was to complete the matrix or fill-in the missing values, since an individual rates only a small proportion of all the movies.  That is, in order to make personal recommendations for an individual respondent, we need to estimate what their ratings would have been for all the movies they did not rate.  One solution was to turn to matrix factorization and take advantage of the structure underlying both movies and viewers preferences.  Matrix completion can be run in R using the new package softImpute.

Although the mathematics behind softImpute can be challenging, the rationale should sound familiar.  Movies are not unique.  You can see this clearly in the maps from the earlier link to matrix factorization.  If nothing else, genre creates similarity, as does leading actor and director.  When asked to sort movies based on their similarity, most viewers tend to make the same kinds of distinctions using many of the same features.  In fact, we could create a content-based recommender system based solely on movie similarity.

Moreover, customers are not unique.  The same features that we use to describe similarity among movies will serve to define preference heterogeneity among respondents.  This reciprocity flows from the mutual co-evolution of movies and viewer preferences.  One successful film release gives rise to many clones, which are seen by the same viewers because they are "like" the original movie.  Consequently, these large matrices are not as complicated as they seem at first.  They appear to be of high dimension, but they have low rank, or alternatively we might say that the observed ratings can be explained by a much smaller number of latent variables.  An example from marketing research might help.

My example is a study where 1603 users were persuaded with incentives to complete a long battery of 165 importance ratings.  The large number of ratings reflected the client's need to measure interest in specific features and services in actual usage occasions.  The complete matrix contains 1603 rows and 165 columns with no missing data.  As you can see in the scree plot below, the first principal component accounted for a considerable amount of the total variation (35%).  There is a clear elbow, although the eigenvalues are still greater than one for the first twenty or so components.  The substantial first principal component reflects the considerable heterogeneity we observe among users.  This is common for product categories with customers who run the spectrum from basic users with limited needs to premium users wanting every available feature and service.

To illustrate how the R package softImpute works, I will discard 85% of the observed ratings.  That is, I will randomly select for each respondent 140 of the 165 ratings and set these ratings to missing (NA).  Thus, the complete matrix had 165 ratings for every respondent, and the matrix to be completed has 25 randomly selected ratings for each respondent.

In order to keep this analysis simple, I will use the defaults from softImpute R package.  The function is also called softImpute(), and it requires only a few arguments.  It needs a data matrix, so I have used the function as.matrix() to convert my data frame into a matrix.  It needs a rank.max value to restrict the rank of the solution, which I gave a value of 25 because of the location of the elbow in the scree plot.  And finally the function requires a value for lambda, the nuclear-norm regularization parameter.  Unfortunately, it would take some time to discuss regularization and how best to set the value of this parameter.  It is an important issue, as is the question of the conditions under which matrix completion might or might not be successful.  But you will have to wait for a later post.  For this analysis, I have set lambda to 15 because the authors of softImpute, Hastie and Mazumder, recommend that lambda should be slightly less than rank.max and most of the examples in the manual set lambda to value of about 60% of the size of rank.max.  As you will see, these values appear to do quite well, so I accepted this solution.
  • fit <- softImpute(importance_missing, rank.max=25, lambda=15)
  • importance_impute<-complete(importance_missing, fit)
This is the R-code needed to run softImpute.  The matrix with the missing importance ratings is called importance_missing.  The object called importance_impute is a 1603 x 165 completed matrix with the missing value replaced by "recommendations" or imputed values.

How good is our completed matrix?  At the aggregate level we are able to reproduce the item means almost perfectly, as you can see from the plot below.  The x-axis shows the item means from the completed matrix after the missing values have been imputed.  The item means calculated from the complete data matrix are the y-values that we want to reproduce.  The high R-squared tells us that we learn a considerable amount about the importance of the individual items even when each respondent is shown only a small randomly selected subset of the item pool.


It is useful to note the spread in the mean item ratings across the 7-point scale.  We do not see the typical negative skew with a large proportion of respondents falling in the top-box of the rating scale, which is so common when importance is measured using a smaller number of more abstract statements.  Of course, this was one of the reasons our client wanted to get "specific" and needed to ask so many items.

As might be expected, the estimation is not as accurate at the level of the individual respondent.  We started with complete data, randomly sampled 15% of the items, and then completed the matrix using softImpute.  The average correlation between the original and imputed ratings for individual respondents across the 165 items was 0.65.  Given that we deleted 85% of each respondent's ratings, an average R-squared of 0.42 seems acceptable.  But is it important to reproduce every respondent's rating to every question?  If this were an achievement test, we would not be concerned about the separate items.  Individual respondent ratings on each item contain far too much noise for this level of analysis.

On the other hand, the strong first principal component that I reported earlier in this post justifies the calculation of a total score for each respondent, and our imputed total score performs well as a predictor of the total score calculated from the complete data matrix (R-squared = 0.93).  In addition to the total score, it is reasonable to expect to find some specific factors or latent variables from which we can calculate factor scores.  I am speaking of a bifactor structure where covariation among the items can be explained by the simultaneously effects of a general factor plus one or more specific factors.  I was able to identify the same four specific factors for the complete data and the imputed data from the softImpute matrix completion.  The average correlation among the four factor scores was 0.79, suggesting that we were able to recover the underlying factor structure even when 85% of the data were missing.

So it seems that we can have it all.  We can have long batteries of detailed questions that require customers to recall their actual experiences with specific features and services in real usage occasions.  In fact, given an underlying low-dimensional factor structure, we need randomly sample only a small fraction of the item pool separately for each respondent.  Matrix factorization will uncover that structure and use it to fill-in the items that were not rated and complete the data matrix.  Redundancy is a wonderful thing.  Use it!

4 comments:

  1. You have to be very careful how you impute items that you left blank on purpose because they aren't relevant or the answer is none. For example, if you don't ask someone how often they visit the store because they say they don't visit the store, you can set the answer to "never" rather than trying to impute it. If you decide to go ahead and impute non-relevant items, you have to be careful interpreting results. For example, how do you interpret the satisfaction score you imputed for those who haven't used the product? Depending on how you imputed, this may or may not have a useful interpretation. It's useful to have a complete matrix for certain kinds of analyses, but you can't blindly use the results. Also, you have to ask yourself why you're imputing at all. With the right experimental design, you can often collect a complete unbiased correlation matrix or a complete unbiased conditional probability table, etc., that could allow you to do your analysis without needing to do any imputation.

    ReplyDelete
    Replies
    1. Remember that the post dealt with planned missingness or missingness by design. The data are missing because the items were randomly sampled. One might call this missing completely at random.

      Delete
  2. Hi Joel, could you please explain how you got the max.rank? From the image you have provided, the elbow of the scree plot appears to start to bend at ~25; hence, that is the max rank? Also, I am interested in learning more about the regularization... looking forward to that post!

    ReplyDelete
    Replies
    1. You are correct. I used the scree plot to set the max rank. If it helps, you can think of it as the maximum number of principal components that one might need to account for all the "true" variation. That is, after some point, the remaining principal components represent error, and we would rather not include them (especially if we are concerned about overfitting, which is point of regularization). Actually, 25 is considerably more than we need, but it seems to work fine.

      Delete