Tuesday, July 21, 2015

"Models, Models Everywhere!" Brought to You by R

Statistical software packages sell solutions. If you go to the home page for SAS, they will tell you upfront that they sell products and solutions. They link both together under the first tab just below "The Power to Know" mantra. SPSS separates product and solution into separate tabs, but places both next to each other on its home page as the first and second clicks. Obviously, both companies are in the solutions business; you have a problem, they have a solution. It's a good positioning to attract customers who are overworked and over their heads. To be clear, no one is questioning the analytics. SPSS and SAS are not selling snake oil, but they are selling something that is designed to appeal to potential customers with more money than time to spend.

R, on the other hand, appeals to the analyst looking outside the traditional box filled with a limited set of statistical models that keep us collecting the same data year after year and running the same analysis each time. My example comes from marketing research where we are repeatedly asked to do "something multivariate" with ratings of idealized features (e.g., cost without price points, quality lacking any specifications, and customer service stripped of context). Before you propose to replace the rating with some ranking task (e.g., MaxDiff), let me remind you that the problem is not the rating but the abstract feature without referent.

The solution is to get concrete if only our analytic tools did not lag behind our data collection capabilities. With decontextualized features we could pretend that we were all on the same page and speaking of the same thing. The details, however, reveal the heterogeneity of product usage and experience. The global space defined by price, quality and service becomes parallel worlds with concentrations of customers paying different amounts for product versions of varying quality with diverging expectations and needs for service. I have many more variables and even more missing data. More importantly, I have non-overlapping customer-feature blocks accompanying each community held together by common usage occasions.

This characterization of the data as local places within a global space came, not from marketing research, but from matrix factorization techniques for recommender systems. Modeling preferences for movies and songs have altered the way we look at all consumption. Everything has become more complex. The traditional clustering models started with feature selection and one set of variables for everyone. Similarly, although factorial invariance across distinct populations might require some preliminary examination, we believed that ultimately we would be able to identify a common group of respondents with which we could perform all dimension reduction. After Netflix and Spotify, all we can see are niche-genre pairings of customers and product features.


Of course, all of this is brought to you by R. SAS and SPSS need a business model before they incorporate the latest procedures. R, on the other hand, provides a platform for innovation by others, academics and entrepreneurs, willing to share and promote their best work. The result is a continuous stream of new ways of seeing and thinking embedded in a diverse collection of models and algorithms, which we call R packages. You can find a listing of all the innovative approaches for jointly blocking the rows and columns of a data matrix under the heading "Simultaneous Clustering in R" in my post The Ecology of Data Matrices.

Models are everywhere and from everywhere. R provides the interface enabling us to lift our heads out of our box and peak into the box down the road in someone else's field.

No comments:

Post a Comment