Engaging Market Research: Warning: Sawtooth's MaxDiff Is Nothing More Than a Technique for Rank Ordering Features!

Monday, January 14, 2013

Warning: Sawtooth's MaxDiff Is Nothing More Than a Technique for Rank Ordering Features!

Sawtooth Software has created a good deal of confusion with its latest sales video published on YouTube. I was contacted last week by one of my clients who had seen the video and wanted to know why I was not using such a powerful technique for measuring attribute importance. "It gives you a ratio scale," he kept repeating. And that is what Sawtooth claims. At about nine minutes into the video, we are told that Maximum Difference Scaling yields a ratio scale where a feature with a score of ten is twice as important as a feature with a score of five.

Where should I begin? Perhaps the best approach is simply to look at the example that Sawtooth provides in the video. Sawtooth begins with a list of the following 10 features that might be important to customers when selecting a fast food restaurant:

1. Clean eating areas (floors, tables, and chairs),
2. Clean bathrooms,
3. Has health food items on the menu,
4. Typical wait time is about 5 minutes in line,
5. Prominently shows calorie information on menu,
6. Prices are very reasonable,
7. Your order is always completed correctly,
8. Has a play area for children,
9. Food tastes wonderful, and
10. Restaurant gives generously to charities.

Sawtooth argues in the video that it becomes impractical for a respondent to rank order more than about seven items. Although that might be true for a phone interview, MaxDiff surveys are administered on the internet. How hard is it to present all 10 features and ask a respondent which is the most important? Let's pretend we are respondents with children. That was easy; "has a play area for children" is the most important. Now the screen is refreshed with only the remaining nine features, and the respondent is again asked to select the most important feature. This continues until all the features have been rank ordered.

What if there were 20 features? The task gets longer and respondents might require some incentive, but the task does not become more difficult. Picking the best of a list becomes more time consuming at the list gets longer. However, the cognitive demands of the task remain the same. One works their way down the list, comparing each new feature to whatever feature was last considered to be the most important. For example, our hypothetical respondent has selected play area for children as the most important feature. If another feature were added to the list, they would compare the new feature to play area for children and decide to keep play area or replace it with the new feature.

Sawtooth argues that such a rank ordering is impractical and substitutes a series of best and worst choices from a reduced set of features. For example, the first four features might be presented to a respondent who is asked to select the most and least important among only those four features. Since Sawtooth explains their data collection and analysis procedures in detail on their website, I will simply provide the link here and make a couple of points. First, one needs a lot of these best-worst selections from sets of four features in order to make individual estimates (think incomplete block designs). Second, it is not the most realistic or interesting task (if you do not believe me, go to the previous link and take the web survey example). Consequently, only a limited number of best-worst sets are presented to any one respondent, and individual estimates are calculated using hierarchical Bayesian estimation.

This is where most get lost if they are not statisticians. The video claims that hierarchical Bayes yields ratio scale estimates that sum to a constant. Obviously, this cannot be correct, not for ratio or interval scaling. The ratio scale claim refers to the finding that one feature might be selected twice as often as another from the list. But that "ratio" depends on what else is in the list. It is not a characteristic of the feature alone. If you change the list or change the wording of the items in the list, you will get a different result. For example, what if the wording for the price feature were changed from "very reasonable" to just "reasonable" without the adverb? How much does the ranking depend on the superlatives used to modify the feature? Everything is relative. All the scores from Sawtooth's MaxDiff are relative to the features included in the set and the way they are described (e.g., vividness and ease of affective imagery will also have an impact).

To make it clear that MaxDiff is nothing more than a rank ordering of the features, consider the following thought experiment. Suppose that you went through the feature list and rank ordered the 10 features. Now you are given a set of four features, but I will use your rankings to describe the features where 1=most important and 10=least important. If the set included features ranked 3rd, 5th, 7th, and 10th, then you would select feature 3 at the most important and feature 10 as the least important. We could do this forever, because selecting the best and worst depends only on the rank ordering of the features. Moreover, it does not matter how close or far way the features are from each other; only their rankings matter.

Actually, Sawtooth has recognized this fact for some time. In a 2009 technical report, which suggested a possible "fix" called the dual response, they admitted that "MaxDiff measures only relative desirability among the items." This supposed "adjustment" was in response to article by Lynd Bacon and others pointing out that there is nothing in MaxDiff scoring to indicate if any of the features are important enough to impact purchase behavior. All we know is the rank ordering of the features, which we will obtain even if no feature is sufficiently important in the marketplace to change intention or behavior. Such research has become commonplace with credit card reward features. It is easy to imagine rank ordering a list of 10 credit card reward features that would provide no incentive to apply for a new card. It is a product design technique that creates the best product that no one will buy. [The effort to "fix" MaxDiff continues as you can see in the proceedings of the 2012 Sawtooth Conference.]

The R package composition

As noted by Karl Pearson some 115 years ago, the constraint that a set of variables sum to some constant value has consequences. Simply put, if the scores for the 10 features sum to 100, then I have only nine degrees of freedom because I can calculate the value of the any one feature once I know the values of the other nine features. As Pearson noted in 1897, this linear dependency creates a spurious negative correlation among the variables. Too often, it is simply ignored and the data analyzed as if there were no dependency. This is an unwise choice, as you can see from this link to The Factor Analysis of Ipsative Measures.

In the social sciences we call this type of data ipsative. In geology it is called compostional data (e.g., percent contribution of basic minerals in a rock sum to 100%). R has a package called composition that provides a comprehensive treatment of such data. However, please be advised that the analysis of ipsative or compositional data can be quite demanding, even for those familiar with simplex geometry. Still, it is an area that has been studied recently by Michael Greenacre (Biplots of Compositonal Data) and by Anna Brown (Item Response Modeling of Forced-Choice Questionnaires).

Forced choice or ranking is appealing because it requires respondents to make trade-offs. This is useful when we believe that respondents are simply rating everything high or low because they are reluctant to tell us everything they know. However, infrequent users do tend to rate everything as less important because they do not use the product that often and most of the features are not important to them. On the other hand, heavy users find lots of features to be important since they use the product all the time and for lots of different purposes.

Finally, we need to remember that these importance measures are self reports, and self reports do not have a good track report. Respondents often do not know what is important to them. For example, how much do people know about what contributes to their purchase of wine? Can they tell us if the label on the wine bottle is important? Mueller and Lockshin compared Best-Worst Scaling (another name for MaxDiff) with a choice modeling task. MaxDiff tells us that the wine label is not important, but the label had a strong impact on which wine was selected in the choice study. We should never forget the very real limitations of self-stated importance.

33 comments:

ThomasJanuary 15, 2013 at 4:24 AM
Hi Joel, Nice post! As a statistician in a Market research company that uses MaxDiff from Sawtooth (a lot!), I was wondering whether there are any packages in R that can do MaxDiff?
ReplyDelete
Replies
selvaJanuary 15, 2013 at 7:15 PM
Rating single factors are always unrealistic. That is why true researchers use orthogonal matrix which can be used to rate combination of features. I can't imagine real life researchers using rank ordering - that method is reserved for academic guys who are good at writing inane and increasingly false research papers based on made up data in lofty journals.
ReplyDelete
Replies
Lynd BaconFebruary 25, 2013 at 7:28 AM
Interesting post, but your reference to my and my colleague's article:

"This supposed "adjustment" was in response to article by Lynd Bacon and others pointing out that there is nothing in MaxDiff scoring to indicate if any of the features are important enough to impact purchase behavior."

Is incorrect. The issue we have described, and have provided one approach to dealing with, is that the part-worths from Max-Diff (and from CBC, for that matter), cannot be compared numerically across respondents. This is a result of the kind of identification constraints used in order to make choice model parameters estimable.

Regards,

Lynd
ReplyDelete
Replies
elKomendaSeptember 26, 2013 at 7:22 AM
Joel,
I can't disagree more with your passage about your passage suggesting that you can full rank 20 items with 20 best-only questions. I don't' know your educational background but as a psychologist, MR practitioner I remember one simple law concerning working memory capacity. It states that on average we can process 7 (+/-2) bits of information. No matter how large incentive you give to your respondents you can't expect that rank 20 exercise will give you reliable results. Simple as that. MaxDiff exercises will be easier, more enjoyable. But yes it's just another way to get partial ranking data.
2nd thing is that, what you say about SS claims on rating character of the data doesn't apply to MD technique but to the way SS analyse MD data. You can use raw data and derive partial rankings on individual level, but SS (for good reasons) use HB to arrive with individual level logistic regression betas. This good reason is filling the gaps of individual uncertainty with population/prior information. You can use the same technique for estimation in full ranking data, or q-sort data (rearranging it to choice data), but it tends to be less effective due to illusive accuracy of a ranking data turned to a choice data.
A last I can agree that MD gives us only "pseudo ratio" measurement. But they (SS) never says that MD scores are absolute;)
ReplyDelete
Replies
elKomendaSeptember 26, 2013 at 7:32 AM
An now to you comment:
MaxDiff is far too artificial and removed from the marketplace to provide useful information.
A few years I've conducted the study where we tested several offerings of a service. The realism of marketplace didn't let us use a choice based conjoint due to several prohibitions needed. We employed MaxDiff with full profiles as items. About 1/3 of the items were real market services 2/3 where new proposals. We have conducted simulation on utils from HB and the results where to 1pp accurate with market shares of those services.
So artificial or not it's good enough in revealing preferences to be used in market research.
BTW your post on quantum phisic's view on MR measurement are quite interesting.
ReplyDelete
Replies
AnonymousFebruary 9, 2015 at 3:51 AM
Great blog buddy!!

I am designing CBC. I have 3 factors with 2, 3 and 4 levels respectively. Using SPSS Orthoplan, i created 16 profiles.

I created choice set design using R package crossdes as a balanced incomplete block design. This gave me 80 choice sets which i need to administer in a survey.

my.design = find.BIB(16, 80, 4)
my.design
isGYD(my.design)

The question is i have 80 choice sets which is huge. Can i randomly divide this into two or 4 subsets and administer to different respondents? Or should i do something else?
ReplyDelete
Replies
RJ HoldenOctober 24, 2017 at 2:25 PM
I have a question: How much experience do you have conducting consumer surveys in person?

I ask, because earlier in my career I did hundreds of in person surveys in various situations and settings. I've also given groups of employees at a company do things like rank order 5 to 6 factors from most to least important. People even screwed this up.

I think you're over-estimating what people will do and how they answer questions on email surveys.
ReplyDelete
Replies

Add comment

Pages

Monday, January 14, 2013

Warning: Sawtooth's MaxDiff Is Nothing More Than a Technique for Rank Ordering Features!

33 comments: