Monday, January 14, 2013

Warning: Sawtooth's MaxDiff Is Nothing More Than a Technique for Rank Ordering Features!

Sawtooth Software has created a good deal of confusion with its latest sales video published on YouTube.  I was contacted last week by one of my clients who had seen the video and wanted to know why I was not using such a powerful technique for measuring attribute importance.  "It gives you a ratio scale,"  he kept repeating.  And that is what Sawtooth claims. At about nine minutes into the video, we are told that Maximum Difference Scaling yields a ratio scale where a feature with a score of ten is twice as important as a feature with a score of five.

Where should I begin?  Perhaps the best approach is simply to look at the example that Sawtooth provides in the video.  Sawtooth begins with a list of  the following 10 features that might be important to customers when selecting a fast food restaurant:

1. Clean eating areas (floors, tables, and chairs),
2. Clean bathrooms,
3. Has health food items on the menu,
4. Typical wait time is about 5 minutes in line,
5. Prominently shows calorie information on menu,
6. Prices are very reasonable,
7. Your order is always completed correctly,
8. Has a play area for children,
9. Food tastes wonderful, and
10. Restaurant gives generously to charities.

Sawtooth argues in the video that it becomes impractical for a respondent to rank order more than about seven items.  Although that might be true for a phone interview, MaxDiff surveys are administered on the internet.  How hard is it to present all 10 features and ask a respondent which is the most important?  Let's pretend we are respondents with children.  That was easy; "has a play area for children" is the most important.  Now the screen is refreshed with only the remaining nine features, and the respondent is again asked to select the most important feature.  This continues until all the features have been rank ordered.

What if there were 20 features?  The task gets longer and respondents might require some incentive, but the task does not become more difficult.  Picking the best of a list becomes more time consuming at the list gets longer.  However, the cognitive demands of the task remain the same.  One works their way down the list, comparing each new feature to whatever feature was last considered to be the most important.  For example, our hypothetical respondent has selected play area for children as the most important feature.  If another feature were added to the list, they would compare the new feature to play area for children and decide to keep play area or replace it with the new feature.

Sawtooth argues that such a rank ordering is impractical and substitutes a series of best and worst choices from a reduced set of features.  For example, the first four features might be presented to a respondent who is asked to select the most and least important among only those four features.  Since Sawtooth explains their data collection and analysis procedures in detail on their website, I will simply provide the link here and make a couple of points.  First, one needs a lot of these best-worst selections from sets of four features in order to make individual estimates (think incomplete block designs).  Second, it is not the most realistic or interesting task (if you do not believe me, go to the previous link and take the web survey example).  Consequently, only a limited number of best-worst sets are presented to any one respondent, and individual estimates are calculated using hierarchical Bayesian estimation.

This is where most get lost if they are not statisticians.  The video claims that hierarchical Bayes yields ratio scale estimates that sum to a constant.  Obviously, this cannot be correct, not for ratio or interval scaling.  The ratio scale claim refers to the finding that one feature might be selected twice as often as another from the list.  But that "ratio" depends on what else is in the list.  It is not a characteristic of the feature alone.  If you change the list or change the wording of the items in the list, you will get a different result.  For example, what if the wording for the price feature were changed from "very reasonable" to just "reasonable" without the adverb?  How much does the ranking depend on the superlatives used to modify the feature?  Everything is relative.  All the scores from Sawtooth's MaxDiff are relative to the features included in the set and the way they are described (e.g., vividness and ease of affective imagery will also have an impact). 

To make it clear that MaxDiff is nothing more than a rank ordering of the features, consider the following thought experiment.  Suppose that you went through the feature list and rank ordered the 10 features.  Now you are given a set of four features, but I will use your rankings to describe the features where 1=most important and 10=least important.  If the set included features ranked 3rd, 5th, 7th, and 10th, then you would select feature 3 at the most important and feature 10 as the least important.  We could do this forever, because selecting the best and worst depends only on the rank ordering of the features.  Moreover, it does not matter how close or far way the features are from each other; only their rankings matter.

Actually, Sawtooth has recognized this fact for some time.  In a 2009 technical report, which suggested a possible "fix" called the dual response, they admitted that "MaxDiff measures only relative desirability among the items."  This supposed "adjustment" was in response to article by Lynd Bacon and others pointing out that there is nothing in MaxDiff scoring to indicate if any of the features are important enough to impact purchase behavior.  All we know is the rank ordering of the features, which we will obtain even if no feature is sufficiently important in the marketplace to change intention or behavior.  Such research has become commonplace with credit card reward features.  It is easy to imagine rank ordering a list of 10 credit card reward features that would provide no incentive to apply for a new card.  It is a product design technique that creates the best product that no one will buy.  [The effort to "fix" MaxDiff continues as you can see in the proceedings of the 2012 Sawtooth Conference.]

The R package composition

As noted by Karl Pearson some 115 years ago, the constraint that a set of variables sum to some constant value has consequences.  Simply put, if the scores for the 10 features sum to 100, then I have only nine degrees of freedom because I can calculate the value of the any one feature once I know the values of the other nine features.  As Pearson noted in 1897, this linear dependency creates a spurious negative correlation among the variables.  Too often, it is simply ignored and the data analyzed as if there were no dependency.  This is an unwise choice, as you can see from this link to The Factor Analysis of Ipsative Measures.

In the social sciences we call this type of data ipsative.  In geology it is called compostional data (e.g., percent contribution of basic minerals in a rock sum to 100%).  R has a package called composition that provides a comprehensive treatment of such data.  However, please be advised that the analysis of ipsative or compositional data can be quite demanding, even for those familiar with simplex geometry.  Still, it is an area that has been studied recently by Michael Greenacre (Biplots of Compositonal Data) and by Anna Brown (Item Response Modeling of Forced-Choice Questionnaires).

Forced choice or ranking is appealing because it requires respondents to make trade-offs.  This is useful when we believe that respondents are simply rating everything high or low because they are reluctant to tell us everything they know.  However, infrequent users do tend to rate everything as less important because they do not use the product that often and most of the features are not important to them.  On the other hand, heavy users find lots of features to be important since they use the product all the time and for lots of different purposes.

Finally, we need to remember that these importance measures are self reports, and self reports do not have a good track report.  Respondents often do not know what is important to them.  For example, how much do people know about what contributes to their purchase of wine?  Can they tell us if the label on the wine bottle is important?  Mueller and Lockshin compared Best-Worst Scaling (another name for MaxDiff) with a choice modeling task.  MaxDiff tells us that the wine label is not important, but the label had a strong impact on which wine was selected in the choice study.  We should never forget the very real limitations of self-stated importance.

29 comments:

  1. Hi Joel, Nice post! As a statistician in a Market research company that uses MaxDiff from Sawtooth (a lot!), I was wondering whether there are any packages in R that can do MaxDiff?

    ReplyDelete
    Replies
    1. Yes, one could do something like MaxDiff in R, but I know of no package or function. One would need to generate an incomplete block design (e.g., using AlgDesign) and then analyze the choice data (e.g., using bayesm).

      You might find the AlgDesign documentation a little difficult, so let me get you started with the R-code needed to generate a balanced incomplete block design with 16 attributes or features in 20 blocks of 4 attributes per block.

      library(AlgDesign)
      set.seed(123456789)

      #blocksize=4 and number of blocks=20 and #number of attributes=16
      BIB<-optBlock(~., withinData=factor(1:16), blocksizes=rep(4,20), nRepeats=5000)

      The documentation for bayesm is quite good, but you will need to get the data into a form that bayesm can read, which can be problem if you are not familiar with R's list structures.

      Still, the point of the post was that one does not want to run MaxDiff. There is no need. If you want a ranking of self-reported importance, just ask for it as I outlined in the post. But be careful with the numbers you get. They are constrained to sum to a constant, either 100 in MaxDiff or the sum of the ranks from a ranking exercise.

      Delete
    2. When I read my response online, I noticed that the R-code might be a little difficult to read, so let me elaborate.

      You need to call the Algdesign library:
      library(AlgDesign)

      You need to set the seed for random number generatation (any number works):
      set.seed(123456789)

      You need the function optBlock to run the design. I placed the output in an object called BIB. Let me separate each piece on its own line:
      optBlock(~.,
      withinData=factor(1:16),
      blocksize=rep(4,20),
      nRepeats=5000)

      The 5000 iterations should be enough to generate a balanced incomplete block design where each attribute occurs 5 times and is paired once with every other attribute.

      Delete
  2. Rating single factors are always unrealistic. That is why true researchers use orthogonal matrix which can be used to rate combination of features. I can't imagine real life researchers using rank ordering - that method is reserved for academic guys who are good at writing inane and increasingly false research papers based on made up data in lofty journals.

    ReplyDelete
  3. Interesting post, but your reference to my and my colleague's article:

    "This supposed "adjustment" was in response to article by Lynd Bacon and others pointing out that there is nothing in MaxDiff scoring to indicate if any of the features are important enough to impact purchase behavior."

    Is incorrect. The issue we have described, and have provided one approach to dealing with, is that the part-worths from Max-Diff (and from CBC, for that matter), cannot be compared numerically across respondents. This is a result of the kind of identification constraints used in order to make choice model parameters estimable.

    Regards,

    Lynd

    ReplyDelete
    Replies
    1. Thank you for your comment, however, I am a little confused. Bryan Orme does reference your article in his 2009 technical report, mentioning it as one of the reasons for introducing dual-response anchoring. And your article, Comparing Apples to Oranges, begins with a critique of MaxDiff scaling as currently practiced. For example, you claim that MaxDiff should not be used for recruiting focus groups because the scores are ipsative. Specifically, you write, "doing things with ipsative scales, such as segmentation analysis and targeting, may not be very useful." So your argument is that one cannot use MaxDiff scores for targeting, but it does tell us which features are important enough to impact purchase behavior.
      You cannot have it both ways. You made the case that MaxDiff was broken in order to justify your data fusion approach for fixing it. I just agreed with you. Moreover, it is not simply an issue of identification constraints. Choice modeling measures demand because one of the alternatives is "none of these." MaxDiff, on the other hand, is a partial ranking task. It deliberately creates tradeoffs in order to break the tie between equally important features. What happens to the high-end buyer who wants it all? What happens to the low-end buyer who cares little about any of the features? MaxDiff forces them to make distinctions.
      When McFadden and others talk about hybrid choice models, the data they seek to combine flows from the marketplace processes and contexts where purchases are made. MaxDiff is far too artificial and removed from the marketplace to provide useful information.

      Delete
    2. Hi, Joel. I just happened upon your reply, above.

      Here's a point we wanted to make in our papers about MaxDiff : It doesn't make sense to compare respondents based on their MaxDiff model coefficients, or scores, because identification constraints remove overall differences between respondents. As a result, respondents' coefficients/scores are not necessarily measured on the same scale. Since they may not be on the same scale, it doesn't make sense to segment based on them, as there's no assurance that the segments would represent homogeneous groups. Since they may not be homogeneous, targeting them may not work so well.

      Now, this doesn't mean that MaxDiff is "broken." It can still be used for scaling sets of items based on partial ranking data, for example. A MaxDiff task may be used to scale the perceived importance of features or other items with respect to some dimension, e.g., likelihood of purchase.

      As we've pointed out in some of our papers, conjoint model results are limited in the same way that MaxDiff scores are, and so segmenting based on them may not produce homogeneous segments. And targeting may not work so well.

      Hope the above helps.

      Delete
    3. Lynd, I want to thank you for taking the time to clarify your views. Clearly, although you are cautious, you are not willing to go as far as I am and reject MaxDiff because it does not offer much that is generalizable to the marketplace. Your position is certainty in the majority with many researchers trying to "improve rather than repeal" MaxDiff. I have read your papers on scale heterogeneity and have found your approach to incorporate auxiliary data to be very interesting. I would encourage readers to check out the library on your website, lba.com.

      Delete
    4. You're welcome, Joel. I think where we differ is that we have different views about what MaxDiff is supposed to be used for. I see it as a model-based scaling technique that's for different things than what conjoint is for. It wasn't developed as an alternative to conjoint. Why would I "reject" it for not doing things it wasn't developed to do?

      Most all methods have limitiations of some sort, of course. Conventional MaxDiff and Conjoint models do share the pesky limitations having to do with the identification constraints. These are actually of two kinds, one having to do with regression coefficients, and the other with models' scale factors.

      Delete
    5. "Warning: Sawtooth's MaxDiff Is Nothing More Than a Technique for Rank Ordering Features!" My title summarizes the point that I was trying to make. I felt that the advertising for MaxDiff made wild claims about the technique yielding ratio-level measurements, and I wanted to introduce a word of caution. It seemed that your article also raised some warnings about using MaxDiff, so I quoted it. Our disagreement appears to be about how good MaxDiff is at yielding a rank ordering. It is a strange argument for me since I believe that ranking attributes is an artificial task borrowed from psychophysics that fails to provide useful information about the purchase process. Yes, every method has limitations, but some more than others. I do not reject MaxDiff because it does not do what it was not developed to do. My criticism of MaxDiff is that it fails to live up to its advertising claims. Thus, you can think of my post as a form of claims substantiation.

      Delete
    6. I understand your perspective on MaxDiff advertising. Speaking as a cognitive psychologist I do think that there's reason to expect that a partial ranking task with replications will produce more reliable data than will a simple ranking task where a large number of alternatives is concerned.

      It seems to me that much of how we think about assessing judgment and choice has origins in psychophysics, the oldest area of experimental psychology. You can see the footprints of Thurstone's Laws in many quantitative models of choice.

      Delete
    7. As a cognitive psychologist, you are familiar with the concept of ecological validity and are reluctant to measure anything outside of the context within which it occurs. Thus, like me, you reject most ranking tasks as artificial and intrusive (unless, of course, a marketplace context can be identified where buyers sequentially pick features or services). Psychophysics has its place in the laboratory, but does not perform well in naturalistic setting (e.g., Ebbinghaus did not teach us much about memory in the real world). In the end, MaxDiff is little more than a game that we play with consumers and one that cannot be generalized to the marketplace. We deceive ourselves when we argue otherwise.

      Delete
    8. Hey, Joel, I just don't happen to agree with you. I apparently have a different understanding of cognitive psychology and experimentalism.

      Delete
  4. Joel,
    I can't disagree more with your passage about your passage suggesting that you can full rank 20 items with 20 best-only questions. I don't' know your educational background but as a psychologist, MR practitioner I remember one simple law concerning working memory capacity. It states that on average we can process 7 (+/-2) bits of information. No matter how large incentive you give to your respondents you can't expect that rank 20 exercise will give you reliable results. Simple as that. MaxDiff exercises will be easier, more enjoyable. But yes it's just another way to get partial ranking data.
    2nd thing is that, what you say about SS claims on rating character of the data doesn't apply to MD technique but to the way SS analyse MD data. You can use raw data and derive partial rankings on individual level, but SS (for good reasons) use HB to arrive with individual level logistic regression betas. This good reason is filling the gaps of individual uncertainty with population/prior information. You can use the same technique for estimation in full ranking data, or q-sort data (rearranging it to choice data), but it tends to be less effective due to illusive accuracy of a ranking data turned to a choice data.
    A last I can agree that MD gives us only "pseudo ratio" measurement. But they (SS) never says that MD scores are absolute;)

    ReplyDelete
    Replies
    1. Selecting the best from 20 alternatives requires the sequential comparison of two alternatives. I look at the first two alternatives and select one. The preferred alternative is compared to the next in the list. If it wins, it is maintained. If it loses, it is replaced by the new winner. The process continues until the end. As a result, only two alternatives are compared in working memory. It is an easy task. You should try it.

      Unfortunately, I do not understand your second point. If I have the rank ordering of the 20 alternatives for a respondent, I know how they will answer any MaxDiff choice set. If I give a respondent their 4th, 10th, 16th, and 17th ranked alternatives, I know that the 4th will be picked as best and the 17th will be picked as worst. Thus, the complete ranking can be used to generate answers to every possible combination. I don't need to "borrow" data using hierarchical Bayes because I have unlimited data for every respondent. In the end, all I know is the rank ordering of the alternatives.

      Delete
    2. Your description is fine from the algorithmic point of view (robot's perspective), but it's will never believe any real respondent will do it that way. It simply doesn't happen this way. We are simplifying our life and most ranking heuristics will be based on the overview of the whole picture of elements.
      The second point is... measurement error and respondents uncertainty. There is as much of it in the ranking data as in any other kind of data. Here the possibility of replacing this missing information comes handy.

      Delete
    3. But we do this all the time. It is common to see a survey question with a long list of benefits, features, touch points or something else. The respondent is asked to pick the best or the top three. However, I do not recommend rankings because they are not informative or generalizable. My point is that MaxDiff is a solution to a non-existent problem. Best-worst choice sets are not easier or faster than complete rankings. We just do not need all the extra design and analysis required by the MaxDiff partial ranking with hierarchical Bayes estimation.

      Delete
    4. As I recall, there was some published results in the 1970's indicating that people have difficulty producing reliable rankings to depths beyond 6 or 7.

      Delete
    5. To be clear, I am not advocating rankings of any number of attributes unless such a ranking is a common marketplace task with which consumers have some familiarity. My only point was that I could select the best and worst for any MaxDiff choice set once I had the complete ranking of the attributes. When asked if respondents were able to rank a large number of attributes, my response was that they do the hardest first steps all the time. I start the ranking process every time I give respondents a checklist of 20 or 30 things and ask them to pick the one most frequently done or the one most preferred. If I continued with the same list minus the one picked and ask again, I would have the first and second rank. It would take some time to complete the iterations, but in the end I would have a complete ranking. If I repeated the task a week later with the same respondents, I would not expect to find high levels of agreement. But let's get real, attribute checklists do not show more than around 50% agreement for short time intervals. Martin Weigel has a very entertaining post from his blog called canalside view. Look for a post on March 12, 2012. Here is the link http://martinweigel.org/tag/byron-sharp/.

      Delete
    6. I see. So for, say, 20 items to be ranked, how many pair-wise comparisons do you imagine a respondent would need to make?

      One thing that you get out of a MaxDiff exercise are model-based estimates of the precision of the scale scores underlying the observed partial ranking responses. Simple ranking can't produce precision estimates.

      Delete
    7. We do this all the time. I give you a list of 20 credit card awards, and you read through the list selecting the most preferred, the next most preferred, and so on. If one felt that the task was taking too long, you could begin with a rating scale or other sort and then ask for rankings of the entities within each scale value. Put another way, I may need paired comparisons when determining the heaviest object, but identifying the longest line does not require a MaxDiff exercise.

      If a reader of this blog wants a reference to learn more about these issues, consider Chapter 7 of Weller and Romney's Systematic Data Collection (1988). It deals in depth with rank order methods including incomplete block designs. It is amazing how much that we consider to be new is simply a reinvention of that already known.

      However, once again, ranking is not a marketplace task, so I do not recommend it. Preferences are constructed not retrieved, thus we need to mimic the marketplace in order to conduct marketing research that is generalizable.

      As to precision estimates, have you checked out the r package StatRank? The authors presented a interesting procedure for the analysis of full rankings at NIPS 2013.

      Delete
    8. Hi, Joel. I suspect that respondents are often asked to rank large numbers of alternatives. A measurement question that comes to mind is, how reliable are an individual respondent's rankings of large numbers of alternatives? Can you estimate the reliability of a respondent's rankings by just asking her for a single soft of alternatives?

      Why do you think people don't rank or order alternatives in markets?

      Delete
    9. Let me start with your last question, why don't people order alternatives in the marketplace?. Ordering alternatives is a lot of unnecessary work when all that is needed is a choice. This is especially true when there is not a lot of risk. But even then, dominated alternatives are removed from the consideration set. Why do we form consideration sets except to avoid additional labor? As to the reliability of rankings, my guess is that the stability of rankings is quite small whether one asks for a complete ranking or derives a complete ranking using pairwise comparisons or incomplete block designs. Individual behavior is context-depend so that we see little stability for ratings, rankings, or checklists when we measure the same respondent at two points in time. Remember, however, that once you have a complete ranking for a set of attributes, you can use that ranking to complete any number of MaxDiff tasks using the simple rule that the highest ranked attribute is best and the lowest ranked attribute is worst. Thus, whatever you can do with MaxDiff, you can do with complete rankings by having the computer generate the MaxDiff data from the complete rankings.

      Delete
    10. Hi, Joel. Thanks for your explanation. It seems to me that when people choose from considered alternatives, they can be thought of as doing at least a partial ranking of the alternatives.

      The point I was after regarding the reliability of ranks is this. If you just collect rankings from respondents, you don't have sufficient information to estimate reliabilities or precisions for their rankings. If, on the other hand, you use a choice model-based method that involves repeated comparisons of alternatives, like MaxDiff does, you can estimate precisions for the parameter estimates underlying the observed choices for each respondent.

      Delete
    11. Paired comparisons and incomplete block designs are procedures for acquiring complete rankings when respondents are unable to provide complete rankings (e.g., comparing the weights of many objects or the tastes of many foods/beverages). Give each respondent all the combinations in an incomplete design, and you have the complete ranking of the attributes. As a result, THERE IS NOTHING THAT YOU CAN DO WITH MAXDIFF THAT CANNOT BE DONE WITH A COMPLETE RANKING. Why? Once you have the complete ranking, you can produce every possible best-worst comparison, as I have elaborated in the post and my previous replies. We do not wish to mislead our readers so that they believe that there is some special magic that only MaxDiff can perform.

      Delete
    12. Joel, what you point out here about using designs to get complete rankings is in fact what a typical MaxDiff task entails, albeit without getting two responses within each choice set or block if you just get a "best" or "most" choice in each set. Getting two responses instead of one per set provides more information for estimation purposes, of course. You'd have even more information if you got complete rankings within each set, but this would make for a harder task for respondents, and a more difficult modeling task for the analyst.

      To estimate respondent-level choice model parameters and precisions for them, your design has to provide for sufficient number of pair-wise comparisons of the items being scaled, of course.

      Joel, thanks for the stimulating conversation. I've enjoyed it.

      Delete
    13. I enjoyed it also. And thank you Lynd for taking the time and making the effort to explain an opposing point of view. My readers have certainly gained by having your comments.

      Delete
  5. An now to you comment:
    MaxDiff is far too artificial and removed from the marketplace to provide useful information.
    A few years I've conducted the study where we tested several offerings of a service. The realism of marketplace didn't let us use a choice based conjoint due to several prohibitions needed. We employed MaxDiff with full profiles as items. About 1/3 of the items were real market services 2/3 where new proposals. We have conducted simulation on utils from HB and the results where to 1pp accurate with market shares of those services.
    So artificial or not it's good enough in revealing preferences to be used in market research.
    BTW your post on quantum phisic's view on MR measurement are quite interesting.

    ReplyDelete
    Replies
    1. Perhaps I do not understand your example. It seems that you used an incomplete block design to reduce the number of full profiles presented in each choice set. If carefully executed, such a design might mimic the marketplace closely enough that the findings could be generalized (though there is no need to ask for worst). Obviously, this is not the typical MaxDiff study, such as the example from the Sawtooth website that I used in this post.

      Delete