Monday, January 14, 2013

Warning: Sawtooth's MaxDiff Is Nothing More Than a Technique for Rank Ordering Features!

Sawtooth Software has created a good deal of confusion with its latest sales video published on YouTube.  I was contacted last week by one of my clients who had seen the video and wanted to know why I was not using such a powerful technique for measuring attribute importance.  "It gives you a ratio scale,"  he kept repeating.  And that is what Sawtooth claims. At about nine minutes into the video, we are told that Maximum Difference Scaling yields a ratio scale where a feature with a score of ten is twice as important as a feature with a score of five.

Where should I begin?  Perhaps the best approach is simply to look at the example that Sawtooth provides in the video.  Sawtooth begins with a list of  the following 10 features that might be important to customers when selecting a fast food restaurant:

1. Clean eating areas (floors, tables, and chairs),
2. Clean bathrooms,
3. Has health food items on the menu,
4. Typical wait time is about 5 minutes in line,
5. Prominently shows calorie information on menu,
6. Prices are very reasonable,
7. Your order is always completed correctly,
8. Has a play area for children,
9. Food tastes wonderful, and
10. Restaurant gives generously to charities.

Sawtooth argues in the video that it becomes impractical for a respondent to rank order more than about seven items.  Although that might be true for a phone interview, MaxDiff surveys are administered on the internet.  How hard is it to present all 10 features and ask a respondent which is the most important?  Let's pretend we are respondents with children.  That was easy; "has a play area for children" is the most important.  Now the screen is refreshed with only the remaining nine features, and the respondent is again asked to select the most important feature.  This continues until all the features have been rank ordered.

What if there were 20 features?  The task gets longer and respondents might require some incentive, but the task does not become more difficult.  Picking the best of a list becomes more time consuming at the list gets longer.  However, the cognitive demands of the task remain the same.  One works their way down the list, comparing each new feature to whatever feature was last considered to be the most important.  For example, our hypothetical respondent has selected play area for children as the most important feature.  If another feature were added to the list, they would compare the new feature to play area for children and decide to keep play area or replace it with the new feature.

Sawtooth argues that such a rank ordering is impractical and substitutes a series of best and worst choices from a reduced set of features.  For example, the first four features might be presented to a respondent who is asked to select the most and least important among only those four features.  Since Sawtooth explains their data collection and analysis procedures in detail on their website, I will simply provide the link here and make a couple of points.  First, one needs a lot of these best-worst selections from sets of four features in order to make individual estimates (think incomplete block designs).  Second, it is not the most realistic or interesting task (if you do not believe me, go to the previous link and take the web survey example).  Consequently, only a limited number of best-worst sets are presented to any one respondent, and individual estimates are calculated using hierarchical Bayesian estimation.

This is where most get lost if they are not statisticians.  The video claims that hierarchical Bayes yields ratio scale estimates that sum to a constant.  Obviously, this cannot be correct, not for ratio or interval scaling.  The ratio scale claim refers to the finding that one feature might be selected twice as often as another from the list.  But that "ratio" depends on what else is in the list.  It is not a characteristic of the feature alone.  If you change the list or change the wording of the items in the list, you will get a different result.  For example, what if the wording for the price feature were changed from "very reasonable" to just "reasonable" without the adverb?  How much does the ranking depend on the superlatives used to modify the feature?  Everything is relative.  All the scores from Sawtooth's MaxDiff are relative to the features included in the set and the way they are described (e.g., vividness and ease of affective imagery will also have an impact). 

To make it clear that MaxDiff is nothing more than a rank ordering of the features, consider the following thought experiment.  Suppose that you went through the feature list and rank ordered the 10 features.  Now you are given a set of four features, but I will use your rankings to describe the features where 1=most important and 10=least important.  If the set included features ranked 3rd, 5th, 7th, and 10th, then you would select feature 3 at the most important and feature 10 as the least important.  We could do this forever, because selecting the best and worst depends only on the rank ordering of the features.  Moreover, it does not matter how close or far way the features are from each other; only their rankings matter.

Actually, Sawtooth has recognized this fact for some time.  In a 2009 technical report, which suggested a possible "fix" called the dual response, they admitted that "MaxDiff measures only relative desirability among the items."  This supposed "adjustment" was in response to article by Lynd Bacon and others pointing out that there is nothing in MaxDiff scoring to indicate if any of the features are important enough to impact purchase behavior.  All we know is the rank ordering of the features, which we will obtain even if no feature is sufficiently important in the marketplace to change intention or behavior.  Such research has become commonplace with credit card reward features.  It is easy to imagine rank ordering a list of 10 credit card reward features that would provide no incentive to apply for a new card.  It is a product design technique that creates the best product that no one will buy.  [The effort to "fix" MaxDiff continues as you can see in the proceedings of the 2012 Sawtooth Conference.]

The R package composition

As noted by Karl Pearson some 115 years ago, the constraint that a set of variables sum to some constant value has consequences.  Simply put, if the scores for the 10 features sum to 100, then I have only nine degrees of freedom because I can calculate the value of the any one feature once I know the values of the other nine features.  As Pearson noted in 1897, this linear dependency creates a spurious negative correlation among the variables.  Too often, it is simply ignored and the data analyzed as if there were no dependency.  This is an unwise choice, as you can see from this link to The Factor Analysis of Ipsative Measures.

In the social sciences we call this type of data ipsative.  In geology it is called compostional data (e.g., percent contribution of basic minerals in a rock sum to 100%).  R has a package called composition that provides a comprehensive treatment of such data.  However, please be advised that the analysis of ipsative or compositional data can be quite demanding, even for those familiar with simplex geometry.  Still, it is an area that has been studied recently by Michael Greenacre (Biplots of Compositonal Data) and by Anna Brown (Item Response Modeling of Forced-Choice Questionnaires).

Forced choice or ranking is appealing because it requires respondents to make trade-offs.  This is useful when we believe that respondents are simply rating everything high or low because they are reluctant to tell us everything they know.  However, infrequent users do tend to rate everything as less important because they do not use the product that often and most of the features are not important to them.  On the other hand, heavy users find lots of features to be important since they use the product all the time and for lots of different purposes.

Finally, we need to remember that these importance measures are self reports, and self reports do not have a good track report.  Respondents often do not know what is important to them.  For example, how much do people know about what contributes to their purchase of wine?  Can they tell us if the label on the wine bottle is important?  Mueller and Lockshin compared Best-Worst Scaling (another name for MaxDiff) with a choice modeling task.  MaxDiff tells us that the wine label is not important, but the label had a strong impact on which wine was selected in the choice study.  We should never forget the very real limitations of self-stated importance.

6 comments:

  1. Hi Joel, Nice post! As a statistician in a Market research company that uses MaxDiff from Sawtooth (a lot!), I was wondering whether there are any packages in R that can do MaxDiff?

    ReplyDelete
    Replies
    1. Yes, one could do something like MaxDiff in R, but I know of no package or function. One would need to generate an incomplete block design (e.g., using AlgDesign) and then analyze the choice data (e.g., using bayesm).

      You might find the AlgDesign documentation a little difficult, so let me get you started with the R-code needed to generate a balanced incomplete block design with 16 attributes or features in 20 blocks of 4 attributes per block.

      library(AlgDesign)
      set.seed(123456789)

      #blocksize=4 and number of blocks=20 and #number of attributes=16
      BIB<-optBlock(~., withinData=factor(1:16), blocksizes=rep(4,20), nRepeats=5000)

      The documentation for bayesm is quite good, but you will need to get the data into a form that bayesm can read, which can be problem if you are not familiar with R's list structures.

      Still, the point of the post was that one does not want to run MaxDiff. There is no need. If you want a ranking of self-reported importance, just ask for it as I outlined in the post. But be careful with the numbers you get. They are constrained to sum to a constant, either 100 in MaxDiff or the sum of the ranks from a ranking exercise.

      Delete
    2. When I read my response online, I noticed that the R-code might be a little difficult to read, so let me elaborate.

      You need to call the Algdesign library:
      library(AlgDesign)

      You need to set the seed for random number generatation (any number works):
      set.seed(123456789)

      You need the function optBlock to run the design. I placed the output in an object called BIB. Let me separate each piece on its own line:
      optBlock(~.,
      withinData=factor(1:16),
      blocksize=rep(4,20),
      nRepeats=5000)

      The 5000 iterations should be enough to generate a balanced incomplete block design where each attribute occurs 5 times and is paired once with every other attribute.

      Delete
  2. Rating single factors are always unrealistic. That is why true researchers use orthogonal matrix which can be used to rate combination of features. I can't imagine real life researchers using rank ordering - that method is reserved for academic guys who are good at writing inane and increasingly false research papers based on made up data in lofty journals.

    ReplyDelete
  3. Interesting post, but your reference to my and my colleague's article:

    "This supposed "adjustment" was in response to article by Lynd Bacon and others pointing out that there is nothing in MaxDiff scoring to indicate if any of the features are important enough to impact purchase behavior."

    Is incorrect. The issue we have described, and have provided one approach to dealing with, is that the part-worths from Max-Diff (and from CBC, for that matter), cannot be compared numerically across respondents. This is a result of the kind of identification constraints used in order to make choice model parameters estimable.

    Regards,

    Lynd

    ReplyDelete
    Replies
    1. Thank you for your comment, however, I am a little confused. Bryan Orme does reference your article in his 2009 technical report, mentioning it as one of the reasons for introducing dual-response anchoring. And your article, Comparing Apples to Oranges, begins with a critique of MaxDiff scaling as currently practiced. For example, you claim that MaxDiff should not be used for recruiting focus groups because the scores are ipsative. Specifically, you write, "doing things with ipsative scales, such as segmentation analysis and targeting, may not be very useful." So your argument is that one cannot use MaxDiff scores for targeting, but it does tell us which features are important enough to impact purchase behavior.
      You cannot have it both ways. You made the case that MaxDiff was broken in order to justify your data fusion approach for fixing it. I just agreed with you. Moreover, it is not simply an issue of identification constraints. Choice modeling measures demand because one of the alternatives is "none of these." MaxDiff, on the other hand, is a partial ranking task. It deliberately creates tradeoffs in order to break the tie between equally important features. What happens to the high-end buyer who wants it all? What happens to the low-end buyer who cares little about any of the features? MaxDiff forces them to make distinctions.
      When McFadden and others talk about hybrid choice models, the data they seek to combine flows from the marketplace processes and contexts where purchases are made. MaxDiff is far too artificial and removed from the marketplace to provide useful information.

      Delete