**Bradley Efron and Indirect Evidence**

In The Future of Indirect Evidence, Bradley Efron encourages us to look at prediction from a different perspective. He reproduces a scatterplot showing the relationship between age and kidney health for 157 healthy volunteers. A regression line is fit, and Efron asks us to predict the kidney health for a new observation of age 55. The direct evidence comes from the one respondent out of the 157 who is 55 years old. The indirect evidence is provided by the regression line, which "borrows strength" from the other 156 in the sample.

When we make a prediction for a specific individual of a certain age, we do not restrict ourselves to only those respondents who are that age (the red dot in the above plot). Instead, we rely on a statistical model of the relationship between age and kidney health. Our prediction comes from the regression line estimated using data from everyone in the sample. Efron summarizes what we have achieved: "Regression models provide an officially sanctioned frequentist mechanism for incorporating the experience of others." The term "frequentist" emphasizes Efron's point that one does not need to be Bayesian in order to "learn from the experience of others."

Recommender systems are another example of using indirect evidence. Direct evidence is limited since a single customer rates only a very small proportion of all available alternatives. Indirect evidence, on the other hand, is substantial given a large number of customers. Although different systems rely on different models and algorithms, recommendations for an individual customer are not possible without borrowing data from other "similar" customers.

Generalizing from this approach, I showed in a previous post how the R package softImpute could be used to estimate missing values when each respondent was shown a random subset of all the items. If you think of a data matrix with respondents as rows and items as columns, missing data are simply empty cells and missing value estimation can be seen as matrix completion. Indirect evidence fills in the empty cells. Matrix completion works because there is a good deal of underlying structure making complete information somewhat redundant. To the extent that one can discover an underlying structure when consumers fill in marketing research questionnaires that structure can be exploited to reduce respondent burden.

**Hierarchical Bayes Choice Modeling**

R offers many alternatives for estimating the parameters of hierarchical models. Believing that it is helpful to be able to run the R code yourself and study the resulting output in depth, I posted some code using the R package bayesm.

Bayesian models excel at their ability to combine direct and indirect evidence using hierarchical structures. Marketing researchers often believe that consumers share a common set of values for evaluating the worth of different product features. Although different individuals may have their own personal weighting schemes for combining the different features that are varied in a choice modeling design, no one is entirely unique. Specifically, if we were able to show each respondent a sufficient number of choice sets, we would be able to obtain separate estimates from every consumers using direct evidence only. What would the set of all individual parameters look like? Hierarchical Bayes assumes that the density plot would be multivariate normal with all the respondents sharing a common mean vector. If that seems too restrictive, bayesm permits a mixture of normals that can approximate distributions with different shapes or even a finite mixture of different segments using more than one average weighting scheme.

However, even if we had sufficient individual-level choice data to estimate parameters separately for each respondent using only direct evidence, we might opt still for the hierarchical Bayes model as a more cautious approach to avoid overfitting individual data. My concern stems from looking at individual-level choice data. It is a humbling experience to debrief respondents and ask them to explain their choices. Why cannot respondents be more attentive to the feature levels in the product description and never mistakenly click on the alternative that they did not intend to select? Limited data and noise at the individual level are excellent reasons for the pooling of individual and aggregate estimates of feature impact on choice. If done correctly, we are not "making up" data but using the data of others to help us obtain better estimates of what would have been uncovered had we gathered more data from each individual and less error in that data.

**Affinity Augmentation: Getting a little help from your "friends"**

So far we have identified two R packages that ask little of an individual respondent and yet yield a good deal of information about that individual. Matrix completion, as implemented in the R package softImpute, fills in missing data by learning about the tastes of others. The R package bayesm accomplishes a similar task by pooling direct evidence from the individual and indirect evidence from others belonging to the same population. In both cases we are augmenting the data from a single individual by borrowing evidence from other respondents who are similar in that they belong to the same population (hierarchical Bayes) or fall into the same neighborhood in the space created by matrix factorization (recommender systems).

I have used the term "affinity augmentation" in order to draw attention to how estimates for a single individual depend on getting a little help from their friends. The two words have been borrowed from data augmentation and network analysis to emphasize that what is added is not unrelated to what is directly observed. Hopefully, something has been gained by placing diverse statistical models under a common heading.

Although I have focused on hierarchical Bayes, all multilevel modeling can be seen as a form of affinity augmentation. Similarly, recommender systems can be viewed as a special case of the duality diagram or any attempt to model simultaneously the rows (individuals) and columns (variables) of a data matrix. R includes an abundance of such packages, including ade4, biplots, and multiple correspondence analysis, but also item response theory and much more.

Shifting perspective can reveal new possibilities. Overfitting reminds us that the responses of a single individual to questions on a marketing survey are not objective truth but a sample from a much larger domain. Moreover, they contain noise. At best, those responses are only indicators of unobserved latent variables that are our primary interest. Why not incorporate the responses of others? Yet, we must be careful with the affinity augmentation process because we are trying to maximize our return and not just "making up" data.

## No comments:

## Post a Comment