Tuesday, May 26, 2015

Respecting Real-World Decision Making and Rejecting Models That Do Not: No MaxDiff or Best-Worst Scaling

Utility has been reified, and we have committed the fallacy of misplaced concreteness.

As this link illustrates, Sawtooth's MaxDiff provides an instructive example of reification in marketing research. What is the contribution of "clean bathrooms" when selecting a fast food restaurant? When using the drive-thru window, the cleanliness of the bathrooms is never considered, yet that is not how we answer that self-report question, either in a rating scale or a best-worst choice exercise. Actual usage never enters the equation. Instead, the wording of the question invites us to enter a Platonic world of ideals inhabited by abstract concepts of "clean bathrooms" and "reasonable prices" where everything can be aligned on a stable and context-free utility scale. We "trade-off" the semantic meanings of these terms with the format of the question shaping our response, such is the nature of self-reports (see especially the Self-Reports paper from 1999).

On the other hand, in the real world sometimes clean bathrooms don't matter (drive-thru) and sometimes they are the determining factor (stopping along the highway during a long drive). Of course, we are assuming that we all agree on what constitutes a clean bathroom and that the perception of cleanliness does not depend on the comparison set (e.g., a public facility without running water). Similarly, "reasonable prices" has no clear referent with each respondent applying their own range each time they see the item in a different context.

It is just all so easy for a respondent to accept the rules of the game and play without much effort. The R package support.BWS (best-worst scaling) will generate the questionnaire with only a few lines of code. You can see two of the seven choice sets below. When the choice sets have been created using a balanced incomplete block design, a rank ordering of the seven fruits can be derived by subtracting the number of worst selections from the number of best picks. It is call "best-worst scaling" because you pick the best and worst from each set. Since the best-worst choice also identifies the pair that is most separated, some use the term MaxDiff rather than best-worst.

 Best Items  Worst
  [ ]  Apple    [ ]
  [ ]  Banana [ ]
  [ ]  Melon    [ ]
  [ ]  Pear      [ ]

 Best Items  Worst
  [ ]  Orange   [ ]
  [ ]  Grapes   [ ]
  [ ]  Banana   [ ]
  [ ]  Melon     [ ]

The terms of play require that we decontextualize in order to make a selection. Otherwise, we could not answer. I love apples, but not for breakfast, and they can be noisy and messy to eat in public. Grapes are good to share, and bananas are easy to take with you in a purse, a bag or a coat pocket. Now, if I am baking a pie or making a salad, it is an entirely different story. Importantly, this is where we find utility, not in the object itself, but in its usage. It is why we buy, and therefore, usage should be marketing's focus.

"Hiring Milkshakes"

Why would 40% of the milkshakes be sold in the early morning? The above link will explain the refreshment demands of the AM commute to work. It will also remind you of the wisdom from Theodore Levitt that one "hires" the quarter inch drill bit in order to produce the quarter inch hole. Utility resides not in the drill bit but in the value of what can be accomplished with the resulting hole. Of course, one buys the power tool in order to do much more than make holes, which brings us to the analysis of usage data.

In an earlier post on taking inventory, I outlined a approach for analyzing usage data when the most frequent response was no, never, none or not applicable. Inquiries about usage access episodic memory so the probes must be specific. Occasion needs to be mentioned for special purchases that would not be recalled without it. The result is high dimensional and sparse data matrices. Thus, while the produce market is filled with different varieties of fruit that can be purchased for various consumption occasions, the individual buyer samples only a small subset of this vast array. Fortunately, R provides a number of approaches, including the non-negative matrix factorization (NMF) outlined in my taking inventory post. We should be careful not to forget that context matters when modeling human judgment and choice.

Note: I believe that the R package support.BWS was added to the CRAN about the time that I posted "Why doesn't R have a MaxDiff package?". As its name implies, the package supports the design, administer and analysis of data using best-worst scaling. However, support.BWS does not attempt to replicate the hierarchical Bayes' estimation implemented in Sawtooth's MaxDiff, which was what was meant by R does not have a MaxDiff package.


  1. Hi Joel,

    I admit I don't really understand why you're confounding the methodology of MaxDiff (as in, the process in which respondents select their answers) with the context (or lack thereof) within the question wording. When you mention that 'cleanliness of bathrooms' would not matter if one uses the drive-thru but would if one sits down within the restaurant, I fully agree with you that context matters. The solution seems simple to me - just have the question stated something like "Imagine you want to sit down at a fast-food restaurant to have a meal. Which of these aspects do you consider most/least important?". Maybe I'm missing something else here, but what does that have to do with MD's methodology?

    1. Now, we have two sets of MaxDiff exercises, one for drive-thru and one for sits down within the restaurant. Do you think that we might get a different rank ordering for breakfast, lunch or dinner? What if it is late in the evening when safety might be important? What if we have young children with us? It takes some time to repeat the MaxDiff exercise for each occasion, and it takes even longer to rerun the MCMC simulations. MaxDiff does not scale well, which is why context is not specified.

    2. This is a problem that applies equally to scaled or rank-ordered questionnaires. If we set the context to late in the evening, I'd be more inclined to select a rating higher up in the scale (or rank it higher in the order) than if the context is in the middle of day.

      I'm not debating whether context matters. Of course it does. But I fail to see how this is a problem that applies only to MaxDiff and not to other scaled or rank-ordered question designs.

    3. Exactly, we need a methodology that enables us to incorporate context because real-world decision making is contextual. And we agree that MaxDiff or Best-Worst Scaling is not the answer. I turned toward machine learning for assistance and found nonnegative matrix factorization (NMF) as a way to analyze the high-dimensional and sparse data resulting when we get specific and ask about usage occasion.

    4. Fair enough regarding NMF (and your posts have been quite interesting regarding this type of analysis), but let's not accuse MD for failing on something it never claimed to solve in the first place. The reason Jordan Louviere came up with this type of question design was to do away with the issue of scale interpretation bias and respondent straight-lining through grid questions.

    5. Except that you are forced to reintroduce the rating scale because best-worst scaling overcorrects so that the uniformly high rater has the same profile as the uniformly low rater (e.g., Sawtooth's Anchored MaxDiff or Louviere's dual response).

  2. Isn't there a base preference updated with conditional probabilities? Like for fruits: Do I not generally prefer apples to melons, clean bathrooms as a sign for food hygiene? Yes, because one restaurant is more successfull than another, one fruit is more successful than another: http://www.statista.com/statistics/264001/worldwide-production-of-fruit-by-variety/