Friday, November 14, 2014

In Praise of Substantive Expertise in Data Science

Substantive expertise makes it into the Data Science Venn Diagram from DataCamp's infographic on how to become a data scientist. It's one of the three circles of equal size along with programming and statistics. Regrettably, substantive expertise is never mentioned in the definition of a data scientist as "someone who is better at statistics than any software engineer and better at software engineering than any statistician." And it gets no step. Statistics is the first step, and the remaining steps cover programming in all its varying forms. "Alas, poor Substance! I knew him, DataCamp."

All of this, of course, is to be taken playfully. I have no quarrel with any of DataCamp's 8-step program. I only ask that we recognize that there are three circles of equal value. Some of us come to data science with substantive expertise and seeking new models for old problems. Some even contribute libraries applying those models in their particular areas of substantive expertise. R provides a common language through which we can visit foreign disciplines and see the same statistical models from a different perspective.

John Chambers reminds us in his UseR! 2014 keynote address that R began as a "user-centric scientific software tool" providing "an interface to the very best numerical algorithms." Adding an open platform for user-submitted packages, R also becomes the interface to a diverse range of applications. This is R's unique selling proposition. It is where one goes for new ways of seeing.

Wednesday, November 12, 2014

Building Blocks: A Compelling Image for Clustering with Nonnegative Matrix Factorization (NMF)

Would hierarchical clustering be as popular without the dendrogram? Cannot the same be said of finite mixture modeling with its multidimensional spaces populated by normal distributions? I invite you to move your mouse over the figure on the introductory page of the website for the R package mclust and click through all the graphics that bring mixture modeling to life. So what is the compelling image for nonnegative matrix factorization (NMF)?


Dendrograms are constructed from distance matrices. We have some choice in the distance or divergence metric, but once a variable has been selected, it is included in all the distance calculations. Finite mixtures and k-means avoid such matrices, but still define fit as some variation on the ratio of between-cluster and within-cluster distances, once again computed from all the variables selected for the analysis. These are the clusters pictured in the above links to dendrograms and isodensity curves in low-dimensional spaces derived from the entire set of included variables. 

However, such representations do not exhaust common ways of talking and thinking about similarity. For example, object substitution in a task or an activity is based on a more limited definition of shared functionality. These are goal-derived categories that I discussed at the end of my post showing how NMF can use top-contender rankings to reveal preference patterns for breakfast foods. Will a Danish pastry be good enough when all the donuts have been eaten? The thought of eating the donut evokes the criteria upon which substitutability and hence similarity will be judged (see Norm Theory: Comparing Reality to Its Alternatives). In the context of toast and other options for breakfast, the donut and the Danish may appear more similar in contrast, yet that is not what comes to mind when hungry for donuts. Similarity can only be defined within a context, as noted by Nelson Goodman

Similarity Derived from Building Blocks in Localized Additive Representations

What did you do today? I could give you a list of activities and ask you to indicate how frequently you engaged in each activity. Who else is like you? Should we demand complete agreement across all the activities, or is it sufficient that you share some common task? If my list is a complete inventory, it will include many relatively infrequent activities performed only by specific subgroups. For example, caregivers for very young children come in all ages and gender from diverse backgrounds and with other responsibilities, yet they form a product category with an entire aisle of the supermarket dedicated to their shared needs. Situational demands pull together individuals in rows and activities in columns to mold a building block of relational data.

To be clear, a hierarchical clustering of respondents or the rows of our data matrix averages over all the columns. We start with an nxp data matrix, but perform the analysis with the nxn dissimilarity or distance matrix. Still, the data matrix contains relational data. Individuals are associated with the activities they perform. Instead of ignoring these relationships between the rows and the columns, we could seek simultaneous clustering of individuals and activities to form blocks running along the diagonal of the data matrix (a block diagonal matrix). Consequently, we may wish to alter our initial figure at the beginning of this post to be more precise and push the colored blocks out of a straight line to form a diagonal with each block demarcated by the intersection of individuals and their frequent activities.

A Toy Example

We can see how it is all accomplished in the following NMF analysis. We will begin with a blank data matrix and combine two blocks to form the final data matrix in the lower right of the figure below.


The final data matrix represents 6 respondents in the rows and 4 activities in the columns. The cells indicate the frequency of engaging in the activity ranging from 0=never to 6=daily. Since the rows and columns have been sorted into two 3x2 blocks along the diagonal, we have no problem directly interpreting this small final data matrix. The frequency of the 4 activities is greatest in the first column and least for third column. The first 3 and last 3 respondents are separated by the first 2 and last 2 activities. It appears that the 6x4 data matrix might be produced by only two latent features.

The following R code creates the final data matrix as the matrix product of respondent mixing weights and activity latent feature coefficients. That is, activities get organized into packets of stuff done by the same respondents, and respondents get clustered based on their activities. If you are familiar with the co-evolution of music genre and listening communities, you will not be surprised by the co- or bi-clustering of rows and columns into these diagonal building blocks. In a larger data matrix, however, we would expect to see both purist with nonzero mixing weights for only one latent feature and hybrids that spread their weights across several latent features. As noted in earlier posts, NMF thrives on sparsity in the data matrix especially when there is clear separation into non-overlapping blocks of rows and columns (e.g., violent action films and romantic comedies appealing to different audiences or luxury stores and discount outlets with customers tending to shop at one or the other).

# enter the data for the respondent mixing weights
MX<-matrix(c(3,2,1,0,0,0,0,0,0,1,1,2), ncol=2)
MX
 
# enter the data for the latent features
LP<-matrix(c(2,1,0,0,0,0,1,2), ncol=4, byrow=TRUE)
LP
 
# observed data is the matrix product
DATA<-MX%*%LP
DATA
 
# load the NMF library
library(NMF)
 
# run with rank=2
fit<-nmf(DATA, 2, "lee", nrun=20)
 
# output the latent feature coefficients
lp<-coef(fit)
round(lp,3)
 
#output the respondent mixing weights
mx<-basis(fit)
round(mx,3)
 
# reproduce the data matrix using NMF results
data<-mx%*%lp
round(data)
# output residuals
round(DATA-data,3)
 
# same but only for 1st latent feature
rank1<-as.matrix(mx[,1])%*%lp[1,]
round(DATA-rank1,3)
 
# same but only for 2nd latent feature
rank2<-as.matrix(mx[,2])%*%lp[2,]
round(DATA-rank2,3)
 
# additive representation 
round(rank1+rank2,3)
round(DATA-rank1-rank2,3)

Created by Pretty R at inside-R.org

The extensive comments in this R code reduce the need for additional explanation except to emphasize that you should copy and run the code in R (after installing NMF). I did not set a seed so that the order of the two parts may be switched. The exercise is intended to imprint the building block imagery. In addition, you might wish to think about how NMF deals with differences in respondent and activity intensity. For example, the first three respondents all engage in the first two activities but with decreasing frequency. Moreover, the same latent feature is responsible for the first two activities, yet the first activity is more frequent than the second. 

I would suggest that the answer can be found in the following section of output from the above code. You must, of course, remember your matrix multiplication. The first cell in our data matrix contains a "6" formed by multiplying the first row of mx by the first column of lp or 0.5 x 12 + 0.0 x 0 = 6. Now, it is easy to see that the 0.500, 0.333 and 0.167 in mx reveal the decreasing intensity of the first latent feature. Examining the rest of mx suggest that the last respondent should have higher scores than the previous two and that is what we discover.

> round(lp,3)
      [,1] [,2] [,3] [,4]
[1,]   12    6    0    0
[2,]    0    0    4    8

> round(mx,3)
          [,1] [,2]
[1,] 0.500 0.00
[2,] 0.333 0.00
[3,] 0.167 0.00
[4,] 0.000 0.25
[5,] 0.000 0.25
[6,] 0.000 0.50

Parting Comments

When you see diagrams, such as the following from Wikipedia, you should take them literally. 


The data matrix V is reproduced approximately by a reduced rank matrix of mixing weights W multiplied by a reduced rank matrix of latent features H. These interpretations of W and H depend on V being a respondents-by-variables data matrix. One needs to be careful because many applications of NMF reverse the rows and columns changing the meaning of W and H. 

The number of columns in W and the number of rows in H can be much smaller than the number of observed variables, which is what is meant by data reduction. The same latent features are responsible for the clustering of respondents and variables. This process of co- or bi-clustering has redefined similarity by computing distances within the building blocks instead of across all the rows and columns. Something had to be done if we wish to include a complete inventory of activities. As the number of activities increase, the data become increasingly sparse and distances become more uniform (see Section 3 The Curse of Dimensionality).

The building block imagery seems to work in this example because different people engage in different activities. The data matrix is sparse due to such joint separation of row and columns. Those building blocks, the latent features, provide a localized additive representation from which we can reproduce the data matrix by stacking the blocks, or stated more accurately, by a convex combination of the latent features.

Wednesday, November 5, 2014

Net Promoter Mixture Modeling: Can a Single Likelihood Rating Reveal Customer Segments?

Net Promoter believes that customers come in one of three forms: promoters (happy yellows), passives (neutral grays), or detractors (angry reds). Cluster identification is relatively easy for all you need to do is ask the "ultimate question" concerning likelihood to recommend. As the figure indicates, the top two boxes are promoters, the bottom six boxes are distractors, and the net promoter score (NPS) is the difference in those aggregate percentages. Although there is considerable controversy surrounding this scoring system, I wish to focus on the density estimation question of whether I can look at the distribution of ratings across this 11-point scale and determine whether or not there are clusters (e.g., a bimodal distribution might suggest the mixing of two components).

The mclust R package (Section 6 Density Estimation) illustrates this type of analysis using eruption data from the Old Faithful geyser. The histogram on the left reflects our situation using only the waiting time between eruptions. The distribution appears to be bimodal suggesting that there might be two different sources generating the eruptions. The two-dimensional plot on the right, displaying both waiting times between eruptions and length of the eruptions, was included to show how easily we can generalize to higher dimensional spaces and how an additional dimension can increase separation. It appears that some eruptions occur more quickly and last for a shorter time than those eruptions with greater latencies and longer durations. You can find a more detailed introduction to mclust and all the R code needed to produce such plots in an earlier post.


Evidence for Clusters in the Distribution of Recommendation Scores?

The observed distribution for likelihood to recommend tends to follow a familiar pattern with the entire curve moved up or down the scale depending on the provider's overall performance. High recommendation ratings are not unusual with the best providers having as many as half of their customers giving the highest possible marks. At the other end, we often find a much smaller group who are quite unhappy. Another bump in the distribution appears in the middle with a larger than otherwise expected proportion using the midpoint. In general, the highest peak is toward the upper end with progressively smaller summits as one moves down the scale.

First, we will look at the density plot for a group of more boutique providers with higher overall recommendation scores. Although the data are from but one proprietary study, one consistently finds similar curves for smaller brands appealing to niche segments.  Each of the 11 bars in the histogram represents a different rating from 0 to 10. Over 40% of the customers tell us that they are extremely likely to recommend. We also see a slight "bump" at 5 and another at the very bottom for a score of 0. Net Promoter claims that this distribution of recommendation ratings was generated by three different customer clusters as shown in the first figure and defined by the intervals 9-10, 7-8 and 0-6.

The mclust R package allows us to test that hypothesis, returning the best BIC for 6 components with equal variances spread out at almost two-point intervals. That is, when asked what mixture of normal distributions might have generated this distribution of recommendation likelihood, mclust identified six components spaced at approximately equal intervals. We can specify the number of clusters by inserting G=3 into the densityMclust function. The three normal density curves representing the three clusters from mclust have been added to the histogram: 78% with mean 9.3, 15% in the middle concentrated about 5.8, and 7% averaging 1.4.

Next, we will repeat the previous analysis with a different set of providers and their customers. As shown below, customers from larger mass market providers followed a similar pattern but with more spread at the top and more use of the middle and bottom boxes. Once again mclust returns 6 components, and when forced with G=3, reproduces the three normal density curves: 61% centered at 8.7, 25% close to the midpoint at 5.2, and 14% at the bottom with a mean of 0.9. The pattern seems to be the same but with lower ratings. At no time do we see any empirical support for the Net Promoter cutpoints.


Finally, we can compare the distributions for recommendation and overall satisfaction for our mass market. A correlation of 0.89 between the two suggests that we are measuring the same construct, except that satisfaction gets higher ratings because customers are more reluctant to recommend than they are to express satisfaction. Generally, recommendation is a more difficult achievement for the provider than satisfaction, or said differently, satisfaction is necessary but not sufficient for recommendation (e.g., must be better than competitors to recommend or must be good for everyone and not just you).


Measuring a Single Avoidance-Approach Evaluative Dimension

Customers probably do come from a mixture of clusters. The hostage is looking for an escape route, the advocate enjoys singing praises, and the habitual is oblivious. If we wished to identify such loyalty segments, we would need to ask about actual behaviors in real situations for this is what differentiates the clusters (e.g., a battery of who, where and when did you recommend rather than some abstract propensity to recommend measured out of any context).

We capture none of the underlying segmentation with ratings scales that measure a single evaluative dimension. Recommendation is not the ultimate question but just another indicator of one's orientation toward the brand, specifically a index of avoidance-approach or thumbs-up and thumbs-down. What separates satisfaction, retention and recommendation is their difficulty level for all three measure the same latent variable, which is why they are so highly correlated.

The peaks and valleys in the satisfaction and recommendation distribution are not indicators of customer type but scale usage. Customers make their point when they use the extremes of a rating scale. The dissatisfied let you know by uniformly giving the lowest possible score. The pleased also want to tell you to keep up the good work. Those who are not sure or don't care overuse the midpoint.

R code from mclust to create these plots:
library(mclust)
promoters <- densityMclust(recommend, G=3)
summary(promoters, parameters = TRUE)
plot(promoters,recommend,breaks=-1:10)

Created by Pretty R at inside-R.org

Monday, November 3, 2014

Let's Do Some MORE Hierarchical Bayes Choice Modeling in R!

R enables us to "test drive" statistical models by running simulations and analyzing actual data. We can look at the output and get some sense of how the whole thing works. The goal is to provide a realistic application that is not too difficult to follow: here is the data, next reformat that data into a form that the function can handle, now for the code to run the analysis and get the output, and finally a little interpretation.

This was my motivation for the original post "Let's Do Some Hierarchical Bayes Choice Modeling in R!" (without the "more"). In that example, I varied pricing for two brands in a 3x3 complete factorial design and included a "Buy None of These" checkbox in order to measure demand. As long as consumers can leave the product category and buy nothing, we need to keep that option in the choice set.
In addition, I allowed for alternative-specific effects by estimating the impact of price separately for the two brands. I was trying to mimic the differential effects of price changes for national and generic products (e.g., when buying olive oil becomes a choice between a well-known Italian and a store brand). At least one could argue that the types of price variations seen in this experiment are not unlike what one sees in the store with side-by-side comparisons and periodic price discounts for both national and store brands. As with any experimental study, choice modeling can be obtrusive. We need to be careful not to create price sensitivity that is not present in the market by introducing artificial variation and focusing the consumer's attention.

This mimicking of the purchase process where consumers decide among several products with varying descriptions may well be the most common implementation of choice modeling in marketing research, Yet, choice modeling can also be used to tradeoff features within a realistic context without including alternative-specific effects. For example, Huber and Train analyze data from choice sets where households selected the company from which they wanted to buy electricity. Assuming that every household wants electricity, customers were presented offers from four different suppliers and ask to select one.

You can find the dataset in the R package mlogit and a complete description of the analysis, including R code, in Section 3 Mixed Logit Model. Instead of repeating the details of the mlogit analysis, I will focus on how to transform the dataset into the format that bayesm wants and run a hierarchical Bayes (HB), which according to Huber and Train produces similar results.

The R Code to Format the Data for bayesm and Run the HB
library("mlogit")
data("Electricity", package = "mlogit")
Electr <- mlogit.data(Electricity, id="id", choice="choice",
  varying=3:26, shape="wide", sep="")
Elec.mxl <- mlogit(choice~pf+cl+loc+wk+tod+seas|0, Electr,
  rpar=c(pf='n', cl='n', loc='n', wk='n', tod='n', seas='n'),
  R=100, halton=NA, print.level=0, panel=TRUE)
summary(Elec.mxl)
 
library(bayesm)
id=levels(as.factor(Electricity$id))
nresp<-length(unique(id))
lgtdata=NULL
 
for (i in 1:nresp)
{
  respdata=Electricity[Electricity$id==id[i],]
  ty<-NULL
  tdesign<-NULL
  ty=respdata$choice
  nobs=length(ty)
  for (j in 1:nobs) {
    design1<-as.matrix(respdata[j,c(3,7,11,15,19,23)])
    design2<-as.matrix(respdata[j,c(4,8,12,16,20,24)])
    design3<-as.matrix(respdata[j,c(5,9,13,17,21,25)])
    design4<-as.matrix(respdata[j,c(6,10,14,18,22,26)])
    tdesign<-rbind(tdesign,design1,design2,design3,design4)
  }
  lgtdata[[i]]=list(y=ty,X=as.matrix(tdesign))
}
 
mcmc=list(R=2000,keep=10)
out=rhierMnlRwMixture(Data=list(p=4,lgtdata=lgtdata),
                      Prior=list(ncomp=1),Mcmc=mcmc)
 
plot(out$loglike,type="l")
trace<-t(apply(out$betadraw,c(2,3),mean))
matplot(trace, type="l")
 
beta.51_200<-apply(out$betadraw[,,51:200],2,mean)
beta.101_200<-apply(out$betadraw[,,101:200],2,mean)
beta.151_200<-apply(out$betadraw[,,151:200],2,mean)
cbind(beta.51_200,beta.101_200,beta.151_200)
 
estimate<-apply(out$betadraw[,,101:200],c(1,2),mean)
estimate2<-cbind(matrix(id),estimate)
write.csv(estimate2, file="estimate.csv")
Created by Pretty R at inside-R.org

Once you understand that mlogit wants the data in a long format and that bayesm is looking for a list structure, it will all become a little easier. There is always nesting in a hierarchical design, and in this case, choice sets are nested within respondent id. Things get a little complicated because all respondents do not complete the same number of choice sets. We have a solution because we know the respondent id, which is what each row of Electricity provides. In addition, each row tells us which of the four offers was selected (choice) and the features for each offer: fixed price (pf1, pf2, pf3, pf4), contract length (cl1, cl2, cl3, cl4), local company (loc1, loc2, loc3, loc4), well-known (wk1, wk2, wk3, wk4), time-of-day (tod1, tod2, tod3, tod4), and seasonal rate (seas1, seas2, seas3, seas4). Now, we have all that is necessary to run the analysis using maximum likelihood (mlogit) or hierarchical Bayes (bayesm).

The mlogit package transforms the wide form with all the choices on a single row into a long form with each alternative on its own row using the mlogit.data( ) function. You should print out the data for the first respondent in both the wide Electricity (12 rows) and the long Electr (4x12=48 rows). In the long format, choice is not a number but a TRUE or FALSE for each alternative (alt). All of this will seem familiar if you have used the reshape function in R or the reshape R package.

The bayesm package adopts a different formatting approach by placing each respondent into a list structure containing choice as the dependent variable and the features gathered together as a design matrix. In the end, each respondent would look like the following for the first respondent who is the first element of the list. If you look back at the first 12 rows of Electricity, you will see that the choices are listed below in $y and the values of the six features are each of the 12 choice sets are repeated in blocks of four rows in $X. Thus, the first element of $y (option 4 was picked) and the feature profile for option 4 in the first choice set was (0, 5, 0, 1, 1, 0). Had there been a "none of these" option, it would have been given a "5" in $y and a fifth row with all zeros would have been inserted at the end of each choice set in $X, so that $X would now have 60 rows instead of 48.

> lgtdata[[1]]
$y
 [1] 4 3 4 4 1 4 1 3 1 2 3 4

$X
   pf1 cl1 loc1 wk1 tod1 seas1
1    7   5    0   1    0     0
1    9   1    1   0    0     0
1    0   0    0   0    0     1
1    0   5    0   1    1     0
2    7   0    0   1    0     0
2    9   5    0   1    0     0
2    0   1    1   0    1     0
2    0   5    0   0    0     1
3    9   5    0   0    0     0
3    7   1    0   1    0     0
3    0   0    0   1    1     0
3    0   0    1   0    0     1
4    0   1    0   1    0     1
4    9   1    0   0    0     0
4    7   0    1   0    0     0
4    0   5    0   1    1     0
5    0   0    1   0    1     0
5    9   1    0   1    0     0
5    0   0    0   0    0     1
5    7   5    0   1    0     0
6    0   0    0   0    0     1
6    9   0    0   0    0     0
6    0   1    1   0    1     0
6    7   5    0   1    0     0
7    0   5    0   1    0     1
7    9   0    0   0    0     0
7    0   1    0   0    1     0
7    7   5    1   0    0     0
8    9   0    0   1    0     0
8    7   1    0   0    0     0
8    0   1    1   0    0     1
8    0   5    0   0    1     0
9    0   5    0   1    0     1
9    0   5    0   0    1     0
9    9   0    0   1    0     0
9    7   1    0   0    0     0
10   0   1    0   1    0     1
10   0   5    1   0    1     0
10   7   0    0   0    0     0
10   9   0    0   1    0     0
11   9   5    0   1    0     0
11   0   0    0   1    1     0
11   0   5    1   0    0     1
11   7   1    0   0    0     0
12   0   1    0   1    1     0
12   7   5    0   0    0     0
12   9   5    1   0    0     0
12   0   0    0   1    0     1

How did I get the data in this format? I used a loop: for (i in 1:nresp). First I pulled out only the data for that one respondent and placed it in respdata. The choice column went into ty. The length of ty told me how many choice sets the respondent completed, and my loop continued until I reached that number. Although there are more direct ways to do this, I simply created separate design matrices for each alternative using the correct column number for this particular dataset. Then, rbind stacked these choice set design matrices and placed them in tdesign. Finally, ty and tdesign were added as an element of the list structure called lgtdata. This entire process is repeated until each respondent has their own element in the list (e.g., the last respondent is lgtdata[[361]]).

Interpreting the bayesm Output

The heavy lifting is over for we have the data formatted for bayesm. HB estimation requires some values for the MCMC, in this case the number of simulations (R=2000) and how many to keep (keep=10). My earlier post with the almost identical title covers this in more detail. Although I do not recommend buying or using their software, Sawtooth is a good source for introductory papers and videos. The review by Allenby and Rossi should be helpful. You may wish to watch the video analyzing choice data.

The function that does the work in bayesm is a hierarchical multinomial logit with a mixture of normal distributions. I have set ncomp=1 to indicate that all the respondents come from the same multivariate normal density. Specifically, hierarchical Bayes "borrows" data from other respondents. It is not unlike borrowing clothes from your friends. It helps if they are a similar shape and size. Here, I am claiming that all the respondents come from the same distribution of weights. Different respondents can have different weights, but the weights for all the individuals are normally distributed with a common mean and variance. Consequently, I cannot go back after the fact and segment respondents for I have already claimed that there is one segment in order to borrow data from others.

I limited the number of simulations so that bayesm runs in only a few minutes. You should copy and run the code after installing mlogit and bayesm. The plots after the HB will display the changes in the fit (loglike) and the coefficients over the 200 runs that were kept (R=2000 but only each 10th was saved). Increasing the number of runs will take more time and help one become more confident that they have achieved convergence. Personally, as shown in the R code, I like to compare the coefficients after discarding the first 25%, 50% and 75% of the runs. If the results to not fluctuate, I am more comfortable.

It should be noted that the coefficients from mlogit and bayesm will not be identical. There are always scaling factors when the goal is to predict choice (e.g., doubling all the coefficients returns the same result for the highest scoring alternative remains the highest scoring). Huber and Train introduce such a scaling factor to adjust their Bayesian estimates. In addition, you should remember that only some of the features are binary effects. For example, length of the contact (loc) has levels of 0, 1 and 5 years. How important is length of the contract? It depends on the variation in length. How important is price? It depends on the price range, implying that I can make price more important in the choice process by increasing its range. Whenever we speak of the importance of some attribute, there is always an unspecified range of values assumed by the speaker, which is why stated importance is typically of such little value.

Finally, I have given the R code to write out a file that Excel can read. I ask that you output and look at these coefficients. Moreover, now that you have the individual estimates as an R object, why not examine the descriptive statistics and plots to determine if the normality assumptions are reasonable. Section 3 on the mixed logit model from the mlogit vignette covering Kenneth Train's exercises is especially valuable because it provides a window into how econometricians approach choice modeling. The term "mixed" is used as it is in ANOVA to indicate that some of the parameters can be fixed and others can be random. Do we use the same levels (fixed) or sample new levels (random) when the study is replicated? Respondents are random because the study is replicated with different households. The six features are fixed since the same choice sets are used again without alteration.

I have tried to make it easier to get started with choice modeling by explaining the R code needed to obtain individual-level hierarchical Bayes estimates. My experience is that once you are confident that you can analyze data, you will be more willing to spend the time and effort to learn the statistics and the psychology underlying these models. In this particular case, one learns a great deal by comparing different estimation procedures from two R packages, bayesm and mlogit. As a bonus, we are introduced to the different viewpoints from marketing and econometrics. Marketing sees heterogeneity as an opportunity, while econometric tends to view it more as a source of variation that must be acknowledged and dealt with (almost as a nuisance).

Sunday, October 26, 2014

Combating Multicollinearity by Asking the Right Questions and Uncovering Latent Features

Overview. When responding to questions about brand perceptions or product feature satisfaction, consumers construct a rating  by relying on their overall satisfaction with the brand or product plus some general category knowledge of how difficult it is to deliver each feature. In order to get pass such halo effects, we need to ask questions that require respondents to relive their prior experiences and access memories of actual occurrences. Then, we must find a statistical model to analyze the high-dimensional and sparse data matrix produced when so many detailed probes return no, never, or none responses. The R package NMF (nonnegative matrix factorization) provides a convenient entry into analysis of latent features for those familiar with factor analysis and mixture models.

Revisiting Stated versus Derived Importance

It has become common practice in survey research to ask participants to act as self-informants. After all, who knows more about the reasons for your behavior than yourself? So why not simply ask why or some variation of that question and be done with it? For example, exit polling wants to how you voted and then the reasons for your vote. The same motivation drives consumer researchers who are not satisfied with purchase intent alone, so they drill down into the causes with direct questions, either open-ended or lists of possible reasons.

All is well as long as the respondent is able and willing to provide a response that can be used to improve the marketing of products or candidates. Unfortunately, "know thyself" is no easier for us than it was for the ancient Greeks. The introspection illusion has been well documented. Simply put, we feel that those reasons that are easiest to provide when asked why must be the motivations for our behavior. The response to the exit poll may be nothing more than a playback of something previously heard or read. Yet, it is so easy to repeat that it must be the true reason. The questions almost write themselves, and the responses are tabulated and tracked over time without much effort at all. You have seen the headlines: "Top 10 Reasons for This or That" or "More Doing This for That Reason." So, what is the alternative?

Marketing research faced a similar situation with the debate over stated versus derived importance. Stated importance, as you might have inferred from the name, is a self-report by the respondent concerning the contribution of a feature or benefit to a purchase decision. The wording is typically nonspecific, such as "how important is price" without any actual pricing information. Respondents supply their own contexts, presumably derived from variations in price that they commonly experience, so that in the end we have no idea of the price range they are considering. Regrettably, findings do not generalize well for what is not important in the abstract becomes very important in the marketplace. The devil is in the details, and actual buying and selling is filled with details.

Derived importance, on the other hand, is the result of a statistical analysis. The experimental version is conjoint analysis or choice modeling. By systematically varying the product description, one estimates the impact of manipulating each attribute or feature. With observational data, one must rely on natural variation and perform a regression analysis predicting purchase intent for a specific brand from other ratings of the same brand.

In both case we are looking for leverage, specifically, a regression coefficient derived from regressing purchase interest on feature levels or feature ratings. If the goal is predicting how consumers will respond to changing product features, then conjoint seems to be winner once you are satisfied that the entire process is not so intrusive that the results cannot be generalized to the market. Yet, varying attributes in an experiment focuses the consumer's attention on aspects that would not be noticed in the marketplace. In the end, the need for multiple ratings or choices from each respondent can create rather than measure demand.

 On the other hand, causal inferences are not possible from observation alone. All we know from the regression analysis are comparisons of the perceptual rating patterns of consumers with different purchase intent. We do not know the directionality or if we have a feedback loop. Do we change the features to impact the perceptions in order to increase purchase intent? Or, do we encourage purchase by discounting price or adding incentives so that product trial will alter perceptions? Both of these approaches might be successful if the associative relationship between perception and intend results from a homeostatic process of mutual feedback and reinforcement.

Generalized Perceptions Contaminated with Overall Satisfaction

Many believe that "good value for the money" is a product perception and not another measure of purchase intent. Those that see value as a feature interpret its high correlation with likelihood to buy as an indication of its derived importance. Although it is possible to think of situations where one is forced to repurchase a product that is not a good value for the money, in general, both items are measuring the same underlying positive affect toward the product. Specifically, the memories that are retrieved to answer the purchase question are the same memories that are retrieved to respond to the values inquiry. Most of what we call "perceptions" are not concrete features or services asked about within a specific usage context that tap different memories. Consequently, we tend to see predictors in the regression equation with substantial multicollinearity from halo effects because we only ask our respondents to recall the "gist" of their interactions and not the details.

Our goal is to collect informative survey data that measures more than a single approach-avoidance evaluative dimension (semantic memory). The multicollinearity among our predictors that continually plagues our regression analyses stems from the lack of specificity in our rating items. Questions that probe episodic memories of features or services used or provided will reduce the halo effect. Unfortunately, specificity creates its own set of problems trying to analyze high-dimensional and sparse data. Different needs generate diverse usage experiences resulting in substantial consumer heterogeneity. Moreover, the infrequently occurring event or the seldom used feature can have a major impact and must be included in order for remedial action to be taken. Some type of regularization is one approach (e.g., the R package glmnet), but I prefer an alternative that attempts to reduce the large number of questions to a smaller set of interpretable latent features.

An Example to Make the Discussion Less Abstract

If we were hired by a cable provider to assess customer satisfaction, we might start with recognizing that not everyone subscribes to all the services offered (e.g., TV, internet, phone and security). Moreover, usage is also likely to make a difference in their satisfaction judgments, varying by the ages and interest of household members. This is what is meant by consumers residing in separate subspaces for parents who use their security system to monitor their children when they are at work have very different experiences from a retired couple without internet access. Do I need to mentions teens in the family? Now, I will ask you to list all the good and bad experiences that a customer might have using all possible services provided by the cable company. It is a long list, but probably not any longer than a comprehensive medical inventory. The space formed by all these items is high-dimensional and sparse.

This is a small section from our data matrix with every customer surveyed as a row and experiences that can be probed and reliably remembered as the columns. The numbers are measures of intensity, such as counts or ratings. The last two respondents did not have any interaction with the six features represented by these columns. The entire data matrix is just more of the same with large patches of zeros indicating that individuals with limited interactions will repeatedly response no, never, or none.


In practice, we tend to compromise since we are seeking only actionable experiences that are frequent or important enough to make a difference and that can be remediated. Yet, even given such restrictions, we are still tapping episodic or autobiographical memories that are relatively free of excessive halo effects because the respondent must "relive" the experience in order to provide a response.

Our data matrix is not random but reflects an underlying dynamics that creates blocks of related rows and columns. In order to simplify this discussion we can restrict ourself to feature usage. For example, sports fans must watch live events in high definition. One's fanaticism is measure by the breadth and frequency of events watched. It is easy to image a block in our data matrix with sport fans as the rows, sporting events as the columns and frequency as the cell entries. Kids in the household along with children's programming generate another block, and so on. To be clear, we co-cluster or bicluster the rows and columns simultaneously for it is their interaction that creates clusters.

The underlying dynamics responsible for the co-clustering of the rows and the columns can be called a latent feature. It is latent because it is not directly observed, and like factor analysis, we will name the latent construct using coefficients or loadings reflecting its relationships to the observed columns. "Feature" was chosen due to the sparsity of the coefficients with only a few sizeable values and the remaining close to zero. As a result, we tend to speak of co-clustering rows and columns so that "latent feature" seems more appropriate than latent variable.

You can find an example analysis of a feature usage inventory using the R package NMF in a previous post. In addition, all the R code needed to run such an analysis can be found in a separate post. In fact, much of my writing over the last several months has focused on NMF, so you may wish to browse. There are other alternatives for biclustering in R, but nonnegative matrix factorization is such an easy transition from principal component analysis and mixture modeling that most should have little trouble performing and interpreting the analysis.

Tuesday, October 21, 2014

Modeling Plenitude and Speciation by Jointly Segmenting Consumers and their Preferences

In 1993, when music was sold in retail stores, it may have been informative to ask about preference across a handful of music genre. Today, now that the consumer has seized control and the music industry has responded, the market has exploded into more than a thousand different fragmented pairings of artists and their audiences. Grant McCracken, the cultural anthropologist, refers to such proliferation as speciation and the resulting commotion as plenitude. As with movies, genre become microgenre forcing recommender systems to deal with more choices and narrower segments.

This mapping from the website Every Noise at Once is constantly changing. As the website explains, there is a generating algorithm with some additional adjustments in order to make it all readable, and it all seems to work as an enjoyable learning interface. One clicks on the label to play a music sample. Then, you can continue to a list of artists associated with the category and hear additional samples from each artist. Although the map seems to have interpretable dimensions and reflects similarity among the microgenre, it does not appear to be a statistical model in its present form.

At any given point in time, we are stepping into a dynamic process of artists searching for differentiation and social media seeking to create new communities who share at least some common preferences. Word of mouth is most effective when consumers expect new entries and when spreading the word is its own reward. It is no longer enough for a brand to have a good story if customers do not enjoy telling that story to others. Clearly, this process is common to all product categories even if they span a much smaller scale. Thus, we are looking for a scalable statistical model that captures the dynamics through which buyers and sellers come to a common understanding. 

Borrowing a form of matrix factorization from recommender systems, I have argued in previous posts for implementing this kind of joint clustering of the rows and columns of a data matrix as a replacement for traditional forms of market segmentation. We can try it with a music preference dataset from the R package prefmod. Since I intend to compare my finding with another analysis of the same 1993 music preference data using the new R package RCA and reported in the American Journal of Sociology, we will begin by duplicating the few data modifications that were made in that paper (see the R code at the end of this post). 

In previous attempts to account for music preferences, psychologists have focused on the individual and turned to personality theory for an explanation. For the sociologist, there is always the social network. As marketing researchers, we will add the invisible hand of the market. What is available? How do consumers learn about the product category and obtain recommendations? Where is it purchased? When and where is it consumed? Are others involved (public vs private consumption)?

The Internet opens new purchase pathways, encourages new entities, increases choice and transfers control to the consumer. The resulting postmodern market with its plenitude of products, services, and features cannot be contained within a handful of segments. Speciation and micro-segmentation demand a model that reflects the joint evolution where new products and features are introduced to meet the needs of specific audiences and consumers organize their attention around those microgenre. Nonnegative matrix factorization (NMF) represents this process with a single set of latent variables describing both the rows and the columns at the same time.

After attaching the music dataset, NMF will produce a cluster heatmap summarizing the "loadings" of the 17 music genre (columns below) on the five latent features (rows below): Blues/Jazz, Heavy Metal/Rap, Country/Bluegrass, Opera/Classical, and Rock. The dendrogram at the top displays the results of a hierarchical clustering. Although there are five latent features, we could use the dendrogram to extract more than five music genre clusters. For example, Big Band and Folk music seem to be grouped together, possibly as a link from classical to country. In addition, Gospel may play a unique role linking country and Blues/Jazz. Whatever we observe in the columns will need to be verified by examining the rows. That is, one might expect to find a segment drawn to country and jazz/blues who also like gospel.


We would have seen more of the lighter colors with coefficients closer to zero had we found greater separation. Yet, this is not unexpected given the coarseness of music genre. As we get more specific, the columns become increasingly separated by consumers who only listen to or are aware of a subset of the available alternatives. These finer distinctions define today's market for just about everything. In addition, the use of a liking scale forces us to recode missing values to a neutral liking. We would have preferred an intensity scale with missing values coded as zeros because they indicate no interaction with the genre. Recoding missing to zero is not an issue when zero is the value given to "never heard of" or unaware.

Now, a joint segmentation means that listeners in the rows can be profiled using the same latent features accounting for covariation among the columns. Based on the above coefficient map, we expect those who like opera to also like classical music so that we do not require two separate scores for opera and classical but only one latent feature score. At least this is what we found with this data matrix. A second heatmap enables us to take a closer look at over 1500 respondents at the same time.


We already know how to interpret this heatmap because we have had practice with the coefficients. These colors indicate the values of the mixing weights for each respondent. Thus, in the middle of the heatmap you can find a dark red rectangle for latent feature #3, which we have already determined to represent country/bluegrass. These individuals give the lowest possible rating to everything except for the genre loading on this latent feature. We do not observe that much yellow or lighter colors in this heatmap because less than 13% of the responses fell into the lowest box labeled "dislike very much." However, most of the lighter regions are where you might expect them to be, for example, heavy metal/rap (#2), although we do uncover a heavy metal segment at the bottom of the figure.

Measuring Attraction and Ignoring Repulsion

We often think of liking as a bipolar scale, although what determines attraction can be different from what drives repulsion. Music is one of those product categories where satisfiers and dissatisfiers tend to be different. Negative responses can become extreme so that preference is defined by what one dislikes rather than what one likes. In fact, it is being forced to listen to music that we do not like that may be responsible for the lowest scores (e.g., being dragged to the opera or loud music from a nearby car). So, what would we find if we collapsed the bottom three categories and measured only attraction on a 3-point scale with 0=neutral, dislike or dislike very much, 1=like, and 2=like very much?

NMF thrives on sparsity, so increasing the number of zeros in the data matrix does not stress the computational algorithm. Indeed, the latent features become more separated as we can see in the coefficient heatmap. Gospel stands alone as its own latent feature. Country and bluegrass remain, as does opera/classical, blues/jazz, and rock. When we "remove" dislike for heavy metal and rap, heavy metal moves into rock and rap floats with reggae between jazz and rock. The same is true for folk and easy mood music, only now both are attractive to country and classical listeners.

More importantly, we can now interpret the mixture weights for individual respondents as additive attractors so that the first few rows are the those with interest in all the musical genre. In addition, we can easily identify listeners with specific interests. As we continue to work our way down the heatmap, we find jazz/blues(#4), followed by rock(#5) and a combination of jazz and rock. Continuing, we see country(#2) plus rock and country alone, after which is a variety of gospel (#1) plus some other genre. We end with opera and classical music, by itself and in combination with jazz.

Comparison with the Cultural Omnivore Hypothesis

As mentioned earlier, we can compare our findings to a published study testing whether inclusiveness rules tastes in music (the eclectic omnivore) or whether cultural distinctions between highbrow and lowbrow still dominate. Interestingly, the cluster analysis is approached as a graph-partitioning problem where the affinity matrix is defined as similarity in the score pattern regardless of mean level. All do not agree with this calculation, and we have a pair of dueling R packages using different definitions of similarity (the RCA vs. the CCA).

None of this is news for those of us who perform cluster analysis using the affinity propagation R package apcluster, which enables several different similarity measures including correlations (signed and unsigned). If you wish to learn more, I would suggest starting with the Orange County R User webinar for apcluster. The quality and breadth of the documentation will flatten your learning curve.

Both of the dueling R packages argue that preference similarity ought to be defined by the highs and lows in the score profiles ignoring the mean ratings for different individuals. This is a problem for marketing since consumers who do not like anything ought to be treated differently from consumers who like everything. One is a prime target and the other is probably not much of a user at all.

Actually, if I were interesting in testing the cultural omnivore hypothesis, I would be better served by collecting familiarity data on a broader range of more specific music genre, perhaps not as detailed as the above map but more revealing than the current broad categories. The earliest signs of preference can be seen in what draws our attention. Recognition tends to be a less obtrusive measure than preference, and we can learn a great deal knowing who visits each region in the music genre map and how long they stayed.

NMF identifies a sizable audience who are familiar with the same subset of music genre. These are the latent features, the building blocks as we have seen in the coefficient heatmaps. The lowbrow and the highbrow each confine themselves to separate latent features, residing in gated communities within the music genre map and knowing little of the other's world. The omnivore travels freely across these borders. Such class distinctions may be even more established in the cosmetics product category (e.g., women's makeup). Replacing genre with brand, you can read how this was handled in a prior post using NMF to analyze brand involvement.

R code to perform all the analyses reported in this post
library(prefmod)
data(music)
 
# keep only the 17 genre used
# in the AMJ Paper (see post)
prefer<-music[,c(1:11,13:18)]
 
# calculate number of missing values for each
# respondent and keep only those with no more
# than 6 missing values
miss<-apply(prefer,1,function(x) sum(is.na(x)))
prefer<-prefer[miss<7,]
 
# run frequency tables for all the variables
apply(prefer,2,function(x) table(x,useNA="always"))
# recode missing to the middle of the 5-point scale
prefer[is.na(prefer)]<-3
# reverse the scale so that larger values are
# associated with more liking and zero is
# the lowest value
prefer<-5-prefer
 
# longer names are easier to interpret
names(prefer)<-c("BigBand",
"Bluegrass",
"Country",
"Blues",
"Musicals",
"Classical",
"Folk",
"Gospel",
"Jazz",
"Latin",
"MoodEasy",
"Opera",
"Rap",
"Reggae",
"ConRock",
"OldRock",
"HvyMetal")
 
library(NMF)
fit<-nmf(prefer, 5, "lee", nrun=30)
coefmap(fit, tracks=NA)
basismap(fit, tracks=NA)
 
# recode bottom three boxes to zero
# and rerun NMF
prefer2<-prefer-2
prefer2[prefer2<0]<-0
# need to remove respondents with all zeros
total<-apply(prefer2,1,sum)
table(total)
prefer2<-prefer2[total>0,]
 
fit<-nmf(prefer2, 5, "lee", nrun=30)
coefmap(fit, tracks=NA)
basismap(fit, tracks=NA)
Created by Pretty R at inside-R.org