Wednesday, July 29, 2015

But I Don't Want to Be a Statistician!

"For a long time I have thought I was a statistician.... But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.... All in all, I have come to feel that my central interest is in data analysis...."

Opening paragraph from John Tukey "The Future of Data Analysis" (1962)

To begin, we must acknowledge that these labels are largely administrative based on who signs your paycheck. Still, I prefer the name "data analysis" with its active connotation. I understand the desire to rebrand data analysis as "data science" given the availability of so much digital information. As data has become big, it has become the star and the center of attention.

We can borrow from Breiman's two cultures of statistical modeling to clarify the changing focus. If our data collection is directed by a generative model, we are members of an established data modeling community and might call ourselves statisticians. On the other hand, the algorithmic modeler (although originally considered a deviant but now rich and sexy) took whatever data was available and made black box predictions. If you need a guide to applied predictive modeling in R, Max Kuhn might be a good place to start.

Nevertheless, causation keeps sneaking in through the back door in the form of causal networks. As an example, choice modeling can be justified as an "as if" predictive modeling but then it cannot be used for product design or pricing. As Judea Pearl notes, most data analysis is "not associational but causal in nature."

Does an inductive bias or schema predispose us to see the world as divided into causes and effects with features creating preference and preference impacting choice? Technically, the hierarchical Bayes choice model does not require the experimental manipulation of feature levels, for example, reporting the likelihood of bus ridership for individuals with differing demographics. Even here, it is difficult not be see causation at work with demographics becoming stereotypes. We want to be able to turn the dial, or at least selection different individuals, and watch choices change. Are such cognitive tendencies part of statistics?

Moreover, data visualization has always been an integral component in the R statistical programming language. Is data visualization statistics? And what of presentations like Hans Rosling's Let My Dataset Change Your Mindset? Does statistics include argumentation and persuasion?

Hadley Wickham and the Cognitive Interpretation of Data Analysis

You have seen all of his data manipulation packages in R, but you may have missed the theoretical foundations in the paper "A Cognitive Interpretation of Data Analysis" by Grolemund and Wickham. Sensemaking is offered as an organizing force with data analysis as an external tool to aid understanding. We can make sensemaking less vague with an illustration.

Perceptual maps are graphical displays of a data matrix such as the one below from an earlier post showing the association between 14 European car models and 27 attributes. Our familiarity with Euclidean spaces aid in the interpretation of the 14 x 27 association table. It summarizes the data using a picture and enables us to speak of repositioning car models. The joint plot can be seen as the competitive landscape and soon the language of marketing warfare brings this simple 14 x 27 table to life. Where is the high ground or an opening for a new entry? How can we guard against an attack from below? This is sensemaking, but is it statistics?

I consider myself to be a marketing researcher, though with a PhD, I get more work calling myself a marketing scientist. I am a data analyst and not a statistician, yet in casual conversation I might say that I am a statistician in the hope that the label provides some information. It seldom does.

I deal in sensemaking. First, I attempt to understand how consumers make sense of products and decide what to buy. Then, I try to represent what I have learned in a form that assists in strategic marketing. My audience has no training in research or mathematics. Statistics plays a role and R helps, but I never wanted to be a statistician. Not that there is anything wrong with that.


  1. I'm saddened by the current "belief" there is a difference between statistician and data analyst. There's none. The difference we should focus on, is the difference between researchers that use data to test hypotheses, and those that use data to generate hypotheses.

    This difference is a lot more important than the definition of a statistician, or the parametric versus non-parametric approach (because that's what the Breiman division essentially boils down to). The main reason why this difference is so important, is that far too many researchers believe they're testing hypotheses when they're actually generating them. And this is in my opinion the main reason why only few scientific conclusions based on whatever type of modelling can withstand the reproducability test, despite all bootstrapping and crossvalidation.

    You don't need a whole lot of math to look at data. What you do need, is a good understanding of how any kind of algorithm reacts on numbers, before you draw conclusions that shouldn't be drawn.

  2. -- I deal in sensemaking. First, I attempt to understand how consumers make sense of products and decide what to buy. Then, I try to represent what I have learned in a form that assists in strategic marketing.

    Whether one uses data or intuition or thought experiments, the point of marketing, in commerce, is to do at least as well as Job's Reality Distortion Field. Which motivates consumers to pay well above the BoM of the widget, thus generating oversize gross margin.

    Whether manipulating potential consumers to do so is a righteous use of our collective brains has been at the heart of much social quant over the decades.

    1. Although this comment is somewhat off topic, the debate is an interesting one. The Boston Review published a forum discussing The New Politics of Consumption. Here is a link:
      Juliet Schor supports your point, but I suggest that you read the responses by Douglas Holt and Craig Thompson.

  3. A very interesting article. My understanding that “statistical predictive models” are based on ‘Particular-General-Particular’ (PGP) principle. Data is collected in the ‘particular (P principle)’ from few sample points; the model is generated with ‘General (G principle)’ i.e. using the summarized collected data; further the resulting model is applied on a “future” sample point as a ‘particular (P principle)’. If ALL good prognostic factors are including to arrive at a predictive model then it will be efficient. But most of the predictive models are inconclusive because they will use only demographic variables and usually the other important factors will not be known. However Statistical analysis (or statistician) will be questioned and not the selection of factors or ignoring the quality of data!!