Tuesday, June 2, 2015
Statistical Models with a Point of View: First vs. Third Person
Marketing data can be collected in the first or third person, and we require different statistical models for each point of view.
Netflix encourages you to adopt a third-person perspective when it surveys your taste preferences by asking how often you watch different genres (e.g., action and adventures, comedies, dramas, horror, thrillers and more). Third-person remembering taps those regions of the brain responsible for semantic memory. We respond as we would in a conversation providing general information about ourselves. Is the Twilight Saga horror or romance? It does not matter since we answer about the genre without retrieving specific movies. Neither do we ponder the definition of the response categories: never, sometimes and often. I never watch horror films because they scare me and I avoid them, which I should not know because I never watch them, except for the horror movies that I do see and call thrillers. Such preference ratings are positioning statements about who we think we are and how we wish to present ourselves.
On the other hand, first-person recollection is needed to rate individual movies that we have seen. We answer by reliving the viewing experience, which is impacted by context (where, when, who we were with and what else we were doing while watching). We call this episodic memory, which is different from semantic memory, with its own region of the brain and its own retrieval process. Someone asks if you liked a particular movie and you cannot remember seeing it until they tell you the actors and describe aspects of the plot. Both of these are examples of episodic memory. First, the movie that was a blur suddenly becomes clear after some detail is mentioned and memories flood your mind. The second example is your first-person recollection that you have had such an episodic memory experience in the past.
We analyze data obtained from third-person remembering using the statistical methods that are most familiar to those with a social science background (e.g., regression, factor analysis and structural equation modeling). Semantic data is, well, semantic, and it has a factor structure that reflects the meaning of words. If you say that you like action films, then any genre associated with action will also be liked, where association is found in the way words are used by marketers, film critics and in everyday conversation. If I mention Jurassic Park, one can bring to mind one or more scenes from the film and perhaps even recall some details about your first viewing. You cannot do the same for the "science fiction" category, assuming that Jurassic Park is science fiction and not fantasy/thriller.
My point is that any questions about the genre or category will be answered by retrieving general semantic knowledge, including the way we have learned to talk about those genres. Thus, if I ask about usage, satisfaction or importance, I will be tapping the same semantic knowledge structures with all the relationships that you have learned over time by telling others what you think and feel and listening to others tell you what they think and feel. It will not matter whether my measurement is a rating or some type of tradeoff or ranking (e.g., MaxDiff). I am not denying that such information is useful to marketing simply that it is at least one step removed from remembered experiences.
This is not the case with first-person recollection that forces the respondent to relive the episode. You recently watched a specific movie and now you give it a rating by remembering how you felt and how much you liked it. Over time you can rate many movies, but only a very small fraction of all that is available. Your movie rating data is high-dimensional and sparse. Moreover, this tends to be the case for episodic data in general when the episodes are occasion-based combinations of many factors (e.g., who uses what for this or that purpose at a particular time in a specific place with or without others present).
In the third-person, we can ask everyone the same question and let them fill in the details. "How important was product quality when you made your purchase?" deliberatively leaves product quality open to interpretation. But in the first-person we ask about a series of specific events: knowing warranty protection and return policy, reviewing user comments, reading expert evaluations, trial usage in a store or through another user, familiarity with the brand, and so on. Of course, product quality is but one of many purchase criteria, so the list of specific events gets quite long and increasingly sparse since potential customers tend to focus their attention on a subset of all the items in our checklist.
As sparse data become more common, R adds more ways to handle it with both supervised (glmnet) and unsupervised (sparcl) packages. The new book by Hastie, Tibshirani and Wainwright, Statistical Learning with Sparsity, brings together all this work along with matrix decomposition and compressed sensing (which is where one would place nonnegative matrix factorization). High dimensionality ceases to be a curse and turns into a blessing when the additional data reveals an underlying structure that we could not observe until we began to ask in the first person.