Monday, November 3, 2014

Let's Do Some MORE Hierarchical Bayes Choice Modeling in R!

R enables us to "test drive" statistical models by running simulations and analyzing actual data. We can look at the output and get some sense of how the whole thing works. The goal is to provide a realistic application that is not too difficult to follow: here is the data, next reformat that data into a form that the function can handle, now for the code to run the analysis and get the output, and finally a little interpretation.

This was my motivation for the original post "Let's Do Some Hierarchical Bayes Choice Modeling in R!" (without the "more"). In that example, I varied pricing for two brands in a 3x3 complete factorial design and included a "Buy None of These" checkbox in order to measure demand. As long as consumers can leave the product category and buy nothing, we need to keep that option in the choice set.
In addition, I allowed for alternative-specific effects by estimating the impact of price separately for the two brands. I was trying to mimic the differential effects of price changes for national and generic products (e.g., when buying olive oil becomes a choice between a well-known Italian and a store brand). At least one could argue that the types of price variations seen in this experiment are not unlike what one sees in the store with side-by-side comparisons and periodic price discounts for both national and store brands. As with any experimental study, choice modeling can be obtrusive. We need to be careful not to create price sensitivity that is not present in the market by introducing artificial variation and focusing the consumer's attention.

This mimicking of the purchase process where consumers decide among several products with varying descriptions may well be the most common implementation of choice modeling in marketing research, Yet, choice modeling can also be used to tradeoff features within a realistic context without including alternative-specific effects. For example, Huber and Train analyze data from choice sets where households selected the company from which they wanted to buy electricity. Assuming that every household wants electricity, customers were presented offers from four different suppliers and ask to select one.

You can find the dataset in the R package mlogit and a complete description of the analysis, including R code, in Section 3 Mixed Logit Model. Instead of repeating the details of the mlogit analysis, I will focus on how to transform the dataset into the format that bayesm wants and run a hierarchical Bayes (HB), which according to Huber and Train produces similar results.

The R Code to Format the Data for bayesm and Run the HB
data("Electricity", package = "mlogit")
Electr <-, id="id", choice="choice",
  varying=3:26, shape="wide", sep="")
Elec.mxl <- mlogit(choice~pf+cl+loc+wk+tod+seas|0, Electr,
  rpar=c(pf='n', cl='n', loc='n', wk='n', tod='n', seas='n'),
  R=100, halton=NA, print.level=0, panel=TRUE)
for (i in 1:nresp)
  for (j in 1:nobs) {
matplot(trace, type="l")
write.csv(estimate2, file="estimate.csv")
Created by Pretty R at

Once you understand that mlogit wants the data in a long format and that bayesm is looking for a list structure, it will all become a little easier. There is always nesting in a hierarchical design, and in this case, choice sets are nested within respondent id. Things get a little complicated because all respondents do not complete the same number of choice sets. We have a solution because we know the respondent id, which is what each row of Electricity provides. In addition, each row tells us which of the four offers was selected (choice) and the features for each offer: fixed price (pf1, pf2, pf3, pf4), contract length (cl1, cl2, cl3, cl4), local company (loc1, loc2, loc3, loc4), well-known (wk1, wk2, wk3, wk4), time-of-day (tod1, tod2, tod3, tod4), and seasonal rate (seas1, seas2, seas3, seas4). Now, we have all that is necessary to run the analysis using maximum likelihood (mlogit) or hierarchical Bayes (bayesm).

The mlogit package transforms the wide form with all the choices on a single row into a long form with each alternative on its own row using the ) function. You should print out the data for the first respondent in both the wide Electricity (12 rows) and the long Electr (4x12=48 rows). In the long format, choice is not a number but a TRUE or FALSE for each alternative (alt). All of this will seem familiar if you have used the reshape function in R or the reshape R package.

The bayesm package adopts a different formatting approach by placing each respondent into a list structure containing choice as the dependent variable and the features gathered together as a design matrix. In the end, each respondent would look like the following for the first respondent who is the first element of the list. If you look back at the first 12 rows of Electricity, you will see that the choices are listed below in $y and the values of the six features are each of the 12 choice sets are repeated in blocks of four rows in $X. Thus, the first element of $y (option 4 was picked) and the feature profile for option 4 in the first choice set was (0, 5, 0, 1, 1, 0). Had there been a "none of these" option, it would have been given a "5" in $y and a fifth row with all zeros would have been inserted at the end of each choice set in $X, so that $X would now have 60 rows instead of 48.

> lgtdata[[1]]
 [1] 4 3 4 4 1 4 1 3 1 2 3 4

   pf1 cl1 loc1 wk1 tod1 seas1
1    7   5    0   1    0     0
1    9   1    1   0    0     0
1    0   0    0   0    0     1
1    0   5    0   1    1     0
2    7   0    0   1    0     0
2    9   5    0   1    0     0
2    0   1    1   0    1     0
2    0   5    0   0    0     1
3    9   5    0   0    0     0
3    7   1    0   1    0     0
3    0   0    0   1    1     0
3    0   0    1   0    0     1
4    0   1    0   1    0     1
4    9   1    0   0    0     0
4    7   0    1   0    0     0
4    0   5    0   1    1     0
5    0   0    1   0    1     0
5    9   1    0   1    0     0
5    0   0    0   0    0     1
5    7   5    0   1    0     0
6    0   0    0   0    0     1
6    9   0    0   0    0     0
6    0   1    1   0    1     0
6    7   5    0   1    0     0
7    0   5    0   1    0     1
7    9   0    0   0    0     0
7    0   1    0   0    1     0
7    7   5    1   0    0     0
8    9   0    0   1    0     0
8    7   1    0   0    0     0
8    0   1    1   0    0     1
8    0   5    0   0    1     0
9    0   5    0   1    0     1
9    0   5    0   0    1     0
9    9   0    0   1    0     0
9    7   1    0   0    0     0
10   0   1    0   1    0     1
10   0   5    1   0    1     0
10   7   0    0   0    0     0
10   9   0    0   1    0     0
11   9   5    0   1    0     0
11   0   0    0   1    1     0
11   0   5    1   0    0     1
11   7   1    0   0    0     0
12   0   1    0   1    1     0
12   7   5    0   0    0     0
12   9   5    1   0    0     0
12   0   0    0   1    0     1

How did I get the data in this format? I used a loop: for (i in 1:nresp). First I pulled out only the data for that one respondent and placed it in respdata. The choice column went into ty. The length of ty told me how many choice sets the respondent completed, and my loop continued until I reached that number. Although there are more direct ways to do this, I simply created separate design matrices for each alternative using the correct column number for this particular dataset. Then, rbind stacked these choice set design matrices and placed them in tdesign. Finally, ty and tdesign were added as an element of the list structure called lgtdata. This entire process is repeated until each respondent has their own element in the list (e.g., the last respondent is lgtdata[[361]]).

Interpreting the bayesm Output

The heavy lifting is over for we have the data formatted for bayesm. HB estimation requires some values for the MCMC, in this case the number of simulations (R=2000) and how many to keep (keep=10). My earlier post with the almost identical title covers this in more detail. Although I do not recommend buying or using their software, Sawtooth is a good source for introductory papers and videos. The review by Allenby and Rossi should be helpful. You may wish to watch the video analyzing choice data.

The function that does the work in bayesm is a hierarchical multinomial logit with a mixture of normal distributions. I have set ncomp=1 to indicate that all the respondents come from the same multivariate normal density. Specifically, hierarchical Bayes "borrows" data from other respondents. It is not unlike borrowing clothes from your friends. It helps if they are a similar shape and size. Here, I am claiming that all the respondents come from the same distribution of weights. Different respondents can have different weights, but the weights for all the individuals are normally distributed with a common mean and variance. Consequently, I cannot go back after the fact and segment respondents for I have already claimed that there is one segment in order to borrow data from others.

I limited the number of simulations so that bayesm runs in only a few minutes. You should copy and run the code after installing mlogit and bayesm. The plots after the HB will display the changes in the fit (loglike) and the coefficients over the 200 runs that were kept (R=2000 but only each 10th was saved). Increasing the number of runs will take more time and help one become more confident that they have achieved convergence. Personally, as shown in the R code, I like to compare the coefficients after discarding the first 25%, 50% and 75% of the runs. If the results to not fluctuate, I am more comfortable.

It should be noted that the coefficients from mlogit and bayesm will not be identical. There are always scaling factors when the goal is to predict choice (e.g., doubling all the coefficients returns the same result for the highest scoring alternative remains the highest scoring). Huber and Train introduce such a scaling factor to adjust their Bayesian estimates. In addition, you should remember that only some of the features are binary effects. For example, length of the contact (loc) has levels of 0, 1 and 5 years. How important is length of the contract? It depends on the variation in length. How important is price? It depends on the price range, implying that I can make price more important in the choice process by increasing its range. Whenever we speak of the importance of some attribute, there is always an unspecified range of values assumed by the speaker, which is why stated importance is typically of such little value.

Finally, I have given the R code to write out a file that Excel can read. I ask that you output and look at these coefficients. Moreover, now that you have the individual estimates as an R object, why not examine the descriptive statistics and plots to determine if the normality assumptions are reasonable. Section 3 on the mixed logit model from the mlogit vignette covering Kenneth Train's exercises is especially valuable because it provides a window into how econometricians approach choice modeling. The term "mixed" is used as it is in ANOVA to indicate that some of the parameters can be fixed and others can be random. Do we use the same levels (fixed) or sample new levels (random) when the study is replicated? Respondents are random because the study is replicated with different households. The six features are fixed since the same choice sets are used again without alteration.

I have tried to make it easier to get started with choice modeling by explaining the R code needed to obtain individual-level hierarchical Bayes estimates. My experience is that once you are confident that you can analyze data, you will be more willing to spend the time and effort to learn the statistics and the psychology underlying these models. In this particular case, one learns a great deal by comparing different estimation procedures from two R packages, bayesm and mlogit. As a bonus, we are introduced to the different viewpoints from marketing and econometrics. Marketing sees heterogeneity as an opportunity, while econometric tends to view it more as a source of variation that must be acknowledged and dealt with (almost as a nuisance).


  1. Thanks for your great blog!

    I would greatly appreciate some help on a similar problem. I am using a matrix of participants(rows) and questions (columns) to generate posterior distributions of responses on a likert scale. I just can't get it to work and really need some assistance. I have attached by attempt and sample dataset

    1L, 1L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 4L, 2L, 1L,
    1L, 1L, 1L, 2L, 2L, 1L, 3L, 2L, 4L, 2L, 1L, 1L, 5L, 1L, 3L),
    K2 = c(3L, 1L, 3L, 3L, 4L, 2L, 2L, 1L, 4L, 1L, 5L, 3L, 2L,
    2L, 3L, 3L, 5L, 2L, 4L, 2L, 4L, 5L, 5L, 3L, 2L, 3L, 3L, 3L,
    2L, 2L, 3L, 3L, 3L, 2L, 5L, 5L, 2L, 1L, 4L, 3L, 1L, 2L, 2L
    ), K3 = c(3L, 4L, 5L, 2L, 3L, 3L, 4L, 4L, 4L, 3L, 4L, 4L,
    3L, 5L, 4L, 3L, 3L, 2L, 3L, 3L, 4L, 4L, 3L, 4L, 4L, 3L, 5L,
    3L, 4L, 3L, 5L, 2L, 2L, 4L, 3L, 4L, 4L, 4L, 2L, 4L, 5L, 3L,
    1L), K4 = c(4L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 1L, 2L,
    1L, 2L, 2L, 3L, 1L, 3L, 3L, 3L, 4L, 3L, 2L, 2L, 2L, 3L, 3L,
    3L, 2L, 3L, 3L, 4L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 3L, 3L,
    2L), K5 = c(3L, 5L, 5L, 5L, 2L, 1L, 2L, 5L, 5L, 4L, 5L, 5L,
    5L, 1L, 4L, 2L, 4L, 3L, 3L, 3L, 3L, 2L, 1L, 4L, 2L, 5L, 4L,
    5L, 5L, 3L, 3L, 5L, 4L, 2L, 4L, 2L, 1L, 3L, 5L, 4L, 5L, 1L,
    5L), K6 = c(2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 2L, 1L,
    2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L,
    1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
    2L)), .Names = c("Region", "K1", "K2", "K3", "K4", "K5",
    "K6"), class = "data.frame", row.names = c(NA, -43L))


    alpha[1]<- -5; alpha[6]<-10; alpha[2]<- 1.5
    alpha[3]<- 2.5; alpha[4]<-3.5; alpha[5]<-4.5

    for(i in 1:n) {for(j in 1:m)
    {lo[i,j]<-((alpha[x[i,j]]-3)/b[i])+3 -a[i]
    up[i,j]<-((alpha[x[i,j]+1]-3)/b[i])+3 -a[i]}

    for(i in 1:n) {z[i,1:m]~dmnorm(mu[],G[,])I(lo[i,],up[i,])}

    # priors for mu and sigma
    for(i in 1:m) {mu[i]~dunif(0,6)}
    for(j in 1:m) {cor[j,j]<-varcov[j,j]}
    for(i in 1:m-1) {for(j in i+1:m)
    {cor[i,j]<-varcov[i,j]/(sqrt(varcov[i,i]*varcov[j,j])); cor[j,i]<-cor[i,j]}

    # DP Priors for a's and b's using stick-breaking algorithm
    for(i in 1:n) {a[i]<-aa[i,1];b[i]<-(aa[i,2])}
    for(j in 1:n) {for (kk in 1:2) {aa[j,kk]<-theta1[latent[j],kk]}}
    for(i in 1:n) {latent[i]~dcat(pi[1:L1])} pi[1]<-r[1]
    for(j in 2:(L1-1)) {log(pi[j])<-log(r[j])+sum(R1[j,1:j-1])
    for(l in 1:j-1) {R1[j,l]<-log(1-r[l])}} pi[L1]<-1-sum(pi[1:(L1-1)])
    for(j in 1:L1) {r[j]~dbeta(1,mm)}

    # baseline distribution for DP as in (5)
    for(i in 1:L1) {theta1[i,1:2]~dmnorm(zero[1:2],Sab[1:2,1:2])I(LB[],)}
    zero[1]<-0; zero[2]<-1
    Sab[1:2,1:2]~dwish(Omega[1:2,1:2],2); varcovab[1:2,1:2]<-inverse(Sab[,])

    # prior for concentration parameter

    1. I welcome all comments but hopefully you will understand if I do not try to debug your code. It requires a considerable amount of back-and-forth communication and simply more time than I have available. An alternative is stackoverflow (look under tags for R).