Sunday, June 28, 2015

Generalizing from Marketing Research: The Right Question and the Correct Analysis

The marketing researcher asks some version of the following question in every study, "Tell me what you want?" The rest is a summary of the notes taken during the ensuing conversation.



Steve Jobs' quote suggests that we might do better getting a reaction to an actual product. You tell me that price is not particularly important to you, yet this one here costs too much. You claim that design is not an issue, except you love the look of the product shown. In casual discussion color does not matter, but that shade of aqua is ugly and you will not buy it.

Although Steve Jobs was speaking of product design using focus groups, we are free to apply his rule to all decontextualized research. "Show it to them" provides the context for product design when we embed the showing within a usage occasion. On the other hand, if you seek incremental improvements to current products and services, you ask about problems experienced or extensions desired in concrete situations because that is the context within which these needs arise. Of course, we end up with a lot more variables in our datasets as soon as we start asking about the details of feature preference or product usage.

For example, instead of rating the importance of color in your next purchase of a car, suppose that you are shown a color array with numerous alternatives to which many of your responses are likely to be "no" or  marked "not applicable" because some colors are associated with options you are not buying. Yet, this is the context within which cars are purchased, and the manufacturer must be careful not to lose a customer when no color option is acceptable. In order to respond to the rating question, the car buyer searches memory for instances of "color problems" in the past. The manufacturer, on the other hand, is concerned about "color problems" in the future when only a handful of specific color combinations are available. Importance is simply the wrong question given the strategic issues.

Because the resulting data are high dimensional and sparse, it will be difficult to analyze with traditional multivariate techniques. This is where R makes it contribution by offering tools from machine and statistical learning designed for sparse and high dimensional data that are produced whenever we provide a context.

We find such analyses in the data from fragmented product categories, where diverse consumer segments shop within distinct distribution channels for non-overlapping products and features (e.g., music purchases by young teens and older retirees). We can turn to R packages for nonnegative matrix factorization (NMF) and matrix completion (softImpute) to exploit such fragmentation and explain the observed high-dimensional and sparse data in terms of a much smaller set of inferred benefits.

What does your car color say about you? It's a topic discussed in the media and among friends. It is a type of collaboration among purchasers who may have never met yet find themselves in similar situations and satisfy their needs in much the same manner. A particular pattern of color preferences has meaning only because it is shared by some community. Matrix factorization reveals that hidden structure by identifying the latent benefits responsible for the observed color choices.

I may be mistaken, but I imagine that Steve Jobs might find all of this helpful.

Monday, June 22, 2015

Looking for Preference in All the Wrong Places: Neuroscience Suggests Choice Model Misspecification

At its core, choice modeling is a utility estimating machine. Everything has a value reflected in the price that we are willing to pay in order to obtain it. Here are a collection of Smart Watches from a search of Google Shopping. You are free to click on any one, look for more, or opt out altogether and buy nothing.


Where is the utility? It is in the brand name, the price, the user ratings and any other feature that gets noticed. If you pick only the Apple Smartwatch at its lowest price, I conclude that brand and price have high utility. It is a somewhat circular definition: I know that you value the Apple brand because you choose it, and you pick the Apple watch because the brand has value. We seem to be willing to live with such circularity as long as utility measured in one setting can be generalized over occasions and conditions. However, context matters when modeling human judgment and choice, making generalization a difficult endeavor. Utility theory is upended when higher prices alter perceptions so that the same food tastes better when it costs more.

What does any of this have to do with neuroscience? Utility theory was never about brain functioning. Glimcher and Fehr make this point in their brief history of neuroeconomics. Choice modeling is an "as if" theory claiming only that decision makers behave as if they assigned values to features and selected the option with the optimal feature mix.

When the choice task has been reduced to a set of feature comparisons as is the common practice in most choice modeling, the process seems to work at least in the laboratory (i.e., care must be taken to mimic the purchase process and adjustment may be needed when making predictions about real-world market shares). Yet, does this describe what one does when looking at the above product display from Google Shopping? I might try to compare the ingredients listed on the back of two packages while shopping in the supermarket. However, most of us find this task quickly becomes too difficult as the number of features exceeds our short-term memory limits (paying attention is costly).

Incorporating Preference Construction

Neuroeconomics suggests how value is constructed on-the-fly in real world choice tasks. Specifically, reinforcement learning is supported by multiple systems within the brain: "dual controllers" for both the association of features and rewards (model-free utilities) and the active evaluation of possible futures (model-based search). Daw, Niv and Dayan identified the two corresponding regions of the brain and summarized the supporting evidence back in 2005.

Features can become directly tied to value so that the reward is inferred immediately from the presence of the feature. Moreover, if we think of choice modeling only as the final stages when we are deciding among a small set of alternatives in a competitive consideration set, we might mistakenly conclude that utility maximization describes decision making. As in the movies, we may wish to "flashback" to the beginning of the purchase process to discover the path that ended at the choice point where features seem to dominate the representation.

Perception, action and utility are all entangled in the wild, as shown by the work of Gershman and Daw. Attention focuses on the most or least desirable features in the context of the goals we wish to achieve. We simulate the essentials of the consumption experience and ignore the rest. Retrospection is remembering the past, and prospection is experiencing the future. The steak garners greater utility sizzling than raw because it is easier to imagine the joy of eating it.

While the cognitive scientist wants to model the details of this process, the marketer will be satisfied learning enough to make the sale and keep the customer happy. In particular, marketing tries to learn what attracts attention, engages interest and consideration, generates desire and perceived need, and drives purchase while retaining customers (i.e., the AIDA model). These are the building blocks from which value is constructed.

Choice modeling, unfortunately, can identify the impact of features only within the confines of a single study, but it encounters difficulties attempting to generalize any effects beyond the data collected. Many of us are troubled that even relatively minor changes can alter the framing of the choice task or direct attention toward a previously unnoticed aspect (Attention and Reference Dependence).

The issue is not one of data collection or statistical analysis. The R package support.BWS will assist with the experimental design, and other R packages such as bayesm, RSGHB and ChoiceModelR will estimate the parameters of a hierarchical Bayes model. No, the difficulty stems from needing to present each respondent with multiple choice scenarios. Even if we limit the number of choice sets that any one individual will evaluate, we are still forced to simplify the task in order to show all the features for all the alternatives in the same choice set. In addition, multiple choice sets impose some demands for consistency so that a choice strategy that can be used over and over again without a lot of effort is preferred by respondents just to get through the questionnaire. On the other hand, costly information search is eliminated, and there is no anticipatory regret or rethinking one's purchase since there is no actual transfer of money. In the end, our choice model is misspecified for two reasons: it does not include the variables that drive purchase in real markets and the reactive effects of the experimental arrangements create confounding effects that do not occur outside of the study.

Measuring the Forces Shaping Preference

Consumption is not random but structured by situational need and usage occasion. "How do you intend to use your Smartwatch?" is a good question to ask when you begin your shopping, although we will need to be specific because small differences in usage can make a large difference in what is purchased. To be clear, we are not looking for well-formed preferences, for instance, feature importance or contribution to purchase. Instead, we focus on attention, awareness and familiarity that might be precursors or early phases of preference formation. If you own an iPhone, you might never learn about Android Wear. What, if anything, can we learn from the apps on your Smartphone?

I have shown how the R package NMF for nonnegative matrix factorization can uncover these building blocks. We might wish to think of NMF as a form of collaborative filtering, not unlike a recommender system that partitions users into cliques or communities and products into genres or types (e.g., sci-fi enthusiasts and the fantasy/thrillers they watch). An individual pattern of awareness and familiarity is not very helpful unless it is shared by a larger community with similar needs arising from common situations. Product markets evolve over time by appealing to segments willing to pay more for differentiated offerings. In turn, the new offerings solidify customer differences. This separation of products and users into segregated communities forms the scaffolding used to construct preferences, and this is where we should begin our research and statistical analysis.

Tuesday, June 2, 2015

Statistical Models with a Point of View: First vs. Third Person



Marketing data can be collected in the first or third person, and we require different statistical models for each point of view.




Netflix encourages you to adopt a third-person perspective when it surveys your taste preferences by asking how often you watch different genres (e.g., action and adventures, comedies, dramas, horror, thrillers and more). Third-person remembering taps those regions of the brain responsible for semantic memory. We respond as we would in a conversation providing general information about ourselves. Is the Twilight Saga horror or romance? It does not matter since we answer about the genre without retrieving specific movies. Neither do we ponder the definition of the response categories: never, sometimes and often. I never watch horror films because they scare me and I avoid them, which I should not know because I never watch them, except for the horror movies that I do see and call thrillers. Such preference ratings are positioning statements about who we think we are and how we wish to present ourselves.

On the other hand, first-person recollection is needed to rate individual movies that we have seen. We answer by reliving the viewing experience, which is impacted by context (where, when, who we were with and what else we were doing while watching). We call this episodic memory, which is different from semantic memory, with its own region of the brain and its own retrieval process. Someone asks if you liked a particular movie and you cannot remember seeing it until they tell you the actors and describe aspects of the plot. Both of these are examples of episodic memory. First, the movie that was a blur suddenly becomes clear after some detail is mentioned and memories flood your mind. The second example is your first-person recollection that you have had such an episodic memory experience in the past.

We analyze data obtained from third-person remembering using the statistical methods that are most familiar to those with a social science background (e.g., regression, factor analysis and structural equation modeling). Semantic data is, well, semantic, and it has a factor structure that reflects the meaning of words. If you say that you like action films, then any genre associated with action will also be liked, where association is found in the way words are used by marketers, film critics and in everyday conversation. If I mention Jurassic Park, one can bring to mind one or more scenes from the film and perhaps even recall some details about your first viewing. You cannot do the same for the "science fiction" category, assuming that Jurassic Park is science fiction and not fantasy/thriller.

My point is that any questions about the genre or category will be answered by retrieving general semantic knowledge, including the way we have learned to talk about those genres. Thus, if I ask about usage, satisfaction or importance, I will be tapping the same semantic knowledge structures with all the relationships that you have learned over time by telling others what you think and feel and listening to others tell you what they think and feel. It will not matter whether my measurement is a rating or some type of tradeoff or ranking (e.g., MaxDiff). I am not denying that such information is useful to marketing simply that it is at least one step removed from remembered experiences.

This is not the case with first-person recollection that forces the respondent to relive the episode. You recently watched a specific movie and now you give it a rating by remembering how you felt and how much you liked it. Over time you can rate many movies, but only a very small fraction of all that is available. Your movie rating data is high-dimensional and sparse. Moreover, this tends to be the case for episodic data in general when the episodes are occasion-based combinations of many factors (e.g., who uses what for this or that purpose at a particular time in a specific place with or without others present).

In the third-person, we can ask everyone the same question and let them fill in the details. "How important was product quality when you made your purchase?" deliberatively leaves product quality open to interpretation. But in the first-person we ask about a series of specific events: knowing warranty protection and return policy, reviewing user comments, reading expert evaluations, trial usage in a store or through another user, familiarity with the brand, and so on. Of course, product quality is but one of many purchase criteria, so the list of specific events gets quite long and increasingly sparse since potential customers tend to focus their attention on a subset of all the items in our checklist.

As sparse data become more common, R adds more ways to handle it with both supervised (glmnet) and unsupervised (sparcl) packages. The new book by Hastie, Tibshirani and Wainwright, Statistical Learning with Sparsity, brings together all this work along with matrix decomposition and compressed sensing (which is where one would place nonnegative matrix factorization). High dimensionality ceases to be a curse and turns into a blessing when the additional data reveals an underlying structure that we could not observe until we began to ask in the first person.

Saturday, May 30, 2015

Top of the Heap: How Much Can We Learn from Partial Rankings?

The recommendation system gives you a long list of alternatives, but the consumer clicks on only a handful: most appealing first, then the second best, and so on until they stop with all the remaining receiving the same rating as not interesting enough to learn more. As a result, we know the preference order for only the most preferred. Survey research may duplicate this process by providing many choices and asking only for your top three selections - your first, second and third picks. This post will demonstrate that by identifying clusters with different ranking schemes (mixtures of rankings), we can learn a good deal about consumer preferences across all the alternatives from observing only a limited ordering of the most desired (partial top-K rankings).

However, we need to remain open to the possibility that our sample is not homogeneous but contains mixtures of varying ranking schemes. To be clear, the reason for focusing on the top-K rankings is because we have included so many alternatives that no one person will be able to respond to all of them. For example, the individual is shown a screen filled with movies, songs, products or attributes and asked to pick the best of the list in order of preference. Awareness and familiarity will focus attention on some subset, but not the same subset for everyone. We should recall that N-K of the possible options will not be selected and thus be given zeros. Consequently, with individuals in rows and alternatives as columns, no one should be surprised to discover that the data matrix has a blockcluster appearance (as in the R package with the same name).
To see how all this works in practice, we begin by generating complete ranking data using the simulISR( ) function from the R package Rankcluster. The above graphic, borrowed from Wikipedia, illustrates the Insertion Sort Ranking (ISR) process that Rankcluster employs to simulate rankings. We start with eight objects in random order and sort them one at a time in a series of paired comparisons. However, the simulation function from Rankcluster allows us to introduce heterogeneity by setting a dispersion parameter called pi. That is, we can generate a sample of individuals sharing a common ranking scheme, yet with somewhat different observed rankings from the addition of an error component.

As an example, everyone intends to move #7 to be between #6 and #8, but some proportion of the sample may make "mistakes" with that proportion controlled by pi. Of course, the error could represent an overlap in the values associated with #6 and #7 or # 7 and #8 so that sometimes one looks better and other times it seems the reverse (sensory discrimination). Regardless, we do not generate a set of duplicate rankings. Instead, we have a group of ranks distributed about a true rank. The details can be found in their technical paper.

You will need to install the Rankcluster and NMF packages in order to run the following R code.

# Rankcluster needed to simulate rankings
library(Rankcluster)
 
# 100 respondents with pi=0.90
# who rank 20 objects from 1 to 20
rank1<-simulISR(100, 0.90, 1:20)
 
# 100 respondents with pi=0.90
# who rank 20 object in reverse order
rank2<-simulISR(100, 0.90, 20:1)
 
# check the mean rankings
apply(rank1, 2, mean)
apply(rank2, 2, mean)
 
# row bind the two ranking schemes
rank<-rbind(rank1,rank2)
 
# set ranks 6 to 20 to be 0s
top_rank<-rank
top_rank[rank>5]<-0
 
# reverse score so that the
# scores now represent intensity
focus<-6-top_rank
focus[focus==6]<-0
 
# use R package NMF to uncover
# mixtures with different rankings
library(NMF)
fit<-nmf(focus, 2, method="lee", nrun=20)
 
# the columns of h transposed
# represent the ranking schemes
h<-coef(fit)
round(t(h))
 
# w contains the membership weights
w<-basis(fit)
 
# hard clustering
type<-max.col(w)
 
# validates the mixture model
table(type,c(rep(1,100),rep(2,100)))

Created by Pretty R at inside-R.org

We begin with the simulIST( ) function simulating two sets of 100 rankings each. The function takes three arguments: the number of rankings to be generated, the value of pi, and the rankings listed for each object. The sequence 1:20 in the first ranking scheme indicates that there will be 20 objects ordered from first to last. Similarly, the sequence 20:1 in the second ranking scheme inputs 20 objects ranked in reverse from last to first. We concatenate data produced by the two ranking schemes and set three-quarters of the rankings to 0 as if only the top-5 rankings were provided. Finally, the scale is reversed so that the non-negative values suggest greater intensity with five as the highest score.

The R package NMF performs the nonnegative matrix factorization with the number of latent features set to two, the number of ranking schemes generating the data. I ask that you read an earlier post for the specifics of how to use the R package NMF to factor partial top-K rankings. More generally though, we are inputting a sparse data matrix with zeros filling 75% of the space. We are trying to reproduce that data (labeled V in the diagram below) by multiplying two matrices. One has a row for every respondent (w in the R code), and the other has a column for every object that was ranked (h in the R code). What links these two matrices is the number of latent features, which in this case happens also to be two because we simulated and concatenated two ranking schemes.



Let us say that we placed 20 bottles of wine along a shelf so that the cheapest was in the first position on the left and the most expensive was last on the shelf on the far right. These are actual wines so that most would agree that the higher priced bottles tended to be of higher quality. Then, our two ranking schemes could be called "price sensitivity" and "demanding palette" (feel free to substitute less positive labels if you prefer). If one could only be Price Sensitive or Demanding Palette and nothing in between, then you would expect precisely 1 to 20 and 20 to 1 rankings for everyone in each segment, respectively, assuming perfect knowledge and execution. That is, some of our drinkers may be unaware that #16 received a higher rating than #17 or simply give it the wrong rank. This is encoded in our pi parameter (pi=0.90 in this simulation). Still, if I knew your group membership and the bottle's position, I could predict your ranking with some degree of accuracy varying with pi.

Nonnegative matrix factorization (NMF) seeks to recover the latent features separating the wines and the latent feature membership for each drinker from the data matrix, which you recall does not contain complete rankings but only the partial top-K. Since I did not set the seed, your results will be similar, but not identical, to the following decomposition.

Columns
h
Demanding Palette
Price Sensitivity
Rows
w
Demanding Palette
Price Sensitivity
C1
0
368
R1
0.00000
0.01317
C2
0
258
R2
0.00100
0.00881
C3
0
145
R3
0.00040
0.00980
C4
4
111
R4
0.00105
0.00541
C5
18
68
R5
0.00000
0.01322
C6
49
80
R6
0.00000
0.01207
C7
33
59
R7
0.00291
0.00541
C8
49
61
R8
0.00361
0.00416
C9
45
50
R9
0.00242
0.01001
C10
112
31
.
.
.
C11
81
30
.
.
.
C12
63
9
.
.
.
C13
79
25
R193
0.01256
0.00000
C14
67
18
R194
0.00366
0.00205
C15
65
28
R195
0.01001
0.00030
C16
79
28
R196
0.00980
0.00000
C17
85
14
R197
0.00711
0.00028
C18
93
5
R198
0.00928
0.00065
C19
215
0
R199
0.01087
0.00000
C20
376
0
R200
0.01043
0.00000

The 20 columns from transposed h are presented first, and then the first few rows followed by the last rows from w. These coefficients will reproduce the data matrix, which contains numbers from 0 to 5. For instance, the reproduced score for the first respondent for the first object is 0*0.00000 + 386*0.01317 = 4.84656 or almost 5, suggesting that they most prefer the cheapest wine. In a similar fashion, the last row, R200, gives greater weight to the first column, and the first column seems to prefer the higher end of the wine continuum.

Clearly, there are some discrepancies toward the middle of the wine rankings, yet the ends are anchored. This makes sense given that we have data only on the top-5 rankings. Our knowledge of the ten objects in the middle comes solely from the misclassification when making pairwise comparisons set by pi=0.90. In the aggregate we seem to be able to see some differentiation even when we did not gather any individual data after the Kth position. Hopefully, C1 represents wine in a box and C20 is a famous vintage from an old village with a long wine history, making our interpretation of the latent features easier for us.

When I run this type of analysis with actual marketing data, I typically uncover many more latent features and find respondents with sizable membership weightings spread across several of those latent features. Preference for wine is based on more than a price-quality tradeoff, so we would expect to see other latent features accounting for the top-5 rankings (e.g., the reds versus the whites). The likelihood that an object makes it into the top-5 selection is a decreasing function of its rank order across the entire range of options so that we might anticipate some differentiate even when the measurement is as coarse as a partial ranking. NMF will discover that decomposition and reproduce the original rankings as I have shown with the above example. It seems that there is much we can learn for partial rankings.

Tuesday, May 26, 2015

Respecting Real-World Decision Making and Rejecting Models That Do Not: No MaxDiff or Best-Worst Scaling




Utility has been reified, and we have committed the fallacy of misplaced concreteness.





As this link illustrates, Sawtooth's MaxDiff provides an instructive example of reification in marketing research. What is the contribution of "clean bathrooms" when selecting a fast food restaurant? When using the drive-thru window, the cleanliness of the bathrooms is never considered, yet that is not how we answer that self-report question, either in a rating scale or a best-worst choice exercise. Actual usage never enters the equation. Instead, the wording of the question invites us to enter a Platonic world of ideals inhabited by abstract concepts of "clean bathrooms" and "reasonable prices" where everything can be aligned on a stable and context-free utility scale. We "trade-off" the semantic meanings of these terms with the format of the question shaping our response, such is the nature of self-reports (see especially the Self-Reports paper from 1999).

On the other hand, in the real world sometimes clean bathrooms don't matter (drive-thru) and sometimes they are the determining factor (stopping along the highway during a long drive). Of course, we are assuming that we all agree on what constitutes a clean bathroom and that the perception of cleanliness does not depend on the comparison set (e.g., a public facility without running water). Similarly, "reasonable prices" has no clear referent with each respondent applying their own range each time they see the item in a different context.

It is just all so easy for a respondent to accept the rules of the game and play without much effort. The R package support.BWS (best-worst scaling) will generate the questionnaire with only a few lines of code. You can see two of the seven choice sets below. When the choice sets have been created using a balanced incomplete block design, a rank ordering of the seven fruits can be derived by subtracting the number of worst selections from the number of best picks. It is call "best-worst scaling" because you pick the best and worst from each set. Since the best-worst choice also identifies the pair that is most separated, some use the term MaxDiff rather than best-worst.

Q1
 Best Items  Worst
  [ ]  Apple    [ ]
  [ ]  Banana [ ]
  [ ]  Melon    [ ]
  [ ]  Pear      [ ]

Q2
 Best Items  Worst
  [ ]  Orange   [ ]
  [ ]  Grapes   [ ]
  [ ]  Banana   [ ]
  [ ]  Melon     [ ]

The terms of play require that we decontextualize in order to make a selection. Otherwise, we could not answer. I love apples, but not for breakfast, and they can be noisy and messy to eat in public. Grapes are good to share, and bananas are easy to take with you in a purse, a bag or a coat pocket. Now, if I am baking a pie or making a salad, it is an entirely different story. Importantly, this is where we find utility, not in the object itself, but in its usage. It is why we buy, and therefore, usage should be marketing's focus.

"Hiring Milkshakes"

Why would 40% of the milkshakes be sold in the early morning? The above link will explain the refreshment demands of the AM commute to work. It will also remind you of the wisdom from Theodore Levitt that one "hires" the quarter inch drill bit in order to produce the quarter inch hole. Utility resides not in the drill bit but in the value of what can be accomplished with the resulting hole. Of course, one buys the power tool in order to do much more than make holes, which brings us to the analysis of usage data.

In an earlier post on taking inventory, I outlined a approach for analyzing usage data when the most frequent response was no, never, none or not applicable. Inquiries about usage access episodic memory so the probes must be specific. Occasion needs to be mentioned for special purchases that would not be recalled without it. The result is high dimensional and sparse data matrices. Thus, while the produce market is filled with different varieties of fruit that can be purchased for various consumption occasions, the individual buyer samples only a small subset of this vast array. Fortunately, R provides a number of approaches, including the non-negative matrix factorization (NMF) outlined in my taking inventory post. We should be careful not to forget that context matters when modeling human judgment and choice.

Note: I believe that the R package support.BWS was added to the CRAN about the time that I posted "Why doesn't R have a MaxDiff package?". As its name implies, the package supports the design, administer and analysis of data using best-worst scaling. However, support.BWS does not attempt to replicate the hierarchical Bayes' estimation implemented in Sawtooth's MaxDiff, which was what was meant by R does not have a MaxDiff package.

Wednesday, May 20, 2015

Clusters Powerful Enough to Generate Their Own Subspaces

Cluster are groupings that have no external label. We start with entities described by a set of measurements but no rule for sorting them by type. Mixture modeling makes this point explicit with its equation showing how each measurement is an independent draw from one of K possible distributions.


Each row of our data matrix contains the measurements for a different object, represented by the vector x in the above equation. If all the rows came from a single normal distribution, then we would not need the subscript K. However, we have a mixture of populations so that measurements come from one of the K groups with probability given by the Greek letter pi. If we knew K, then we would know the mean mu and covariance matrix sigma that describe the Gaussian distribution generating our observation.


The above graphical model attempts to illustrate the entire process using plate notation. That is, the K and the N in the lower right corner of the two boxes indicate that we have chosen not to show all of the K or N different boxes, one for each group and one for each observation, respectively. The arrows represent directed effects so that group membership in the box with [K] is outside the measurement process. With K known, the corresponding mean and variance act as input to generate one of the i = 1,...,N observations.

This graphical model describes a production process that may be responsible for our data matrix. We must decide on a value for K (the number of clusters) and learn the probabilities for each of the K groups (pi is a K-valued vector). But we are not done estimating parameters. Each of the K groups has a mean vector and a variance-covariance matrix that must be estimated, and both depend on the number of columns (p) in the data matrix: (1) Kp means and (2) Kp(p+1)/2 variances and covariances. Perhaps we should be concerned that the number of parameters increases so rapidly with the number of variables p.

A commonly used example will help us understand the equation and the graphical model. The Old Faithful dataset included with the R package mclust illustrates that eruptions from the geyser can come from one of two sources: the brief eruptions in red with shorter waiting times and the extended eruptions in blue with longer waiting periods. There are two possible sources (K=2), and each source generates a bivariate normal distribution of eruption duration and waiting times (N=number of combined red squares and blue dots). Finally, our value of pi can be calculated by comparing the number of red and blue points in the figure.



Scalability Issues in High Dimensions

The red and the blue eruptions reside in the same two-dimensional space since the clustering depends only on duration. This would not be the case with topic modeling, for example, where each topic might be defined by a specific set of anchor words that would separate each topic from the rest. Similarly, if we were to cluster by music preference, we would discover segments with very specific awareness and knowledge of various artists. Again, the music preference groupings would be localized within different subspaces anchored by the more popular artists within that genre. Market baskets appear much the same with each filled with the staples and then those few items that differentiate among segments (e.g., who buys adult diapers?). In each of these cases, as with feature usage and product familiarity, we are forced to collect information across a wide range of measures because each cluster requires its own set of variables to distinguish itself from the others.

These clusters have been created by powerful forces that are stable over time: major events (e.g., moving out on your own, getting married, buying a house, having a child or retiring) and not so major events (e.g., clothes for work, devices to connect to the internet, or what to buy for dinner). Situational needs and social constraints focus one's attention so that any single individual can become familiar with only a small subset of all that we need to measure in order to construct a complete partition. Your fellow cluster members are others who find themselves in similar circumstances and resolve their conflict in much the same way.

As a result, the data matrix becomes high dimensional with many columns, but the rows are sparse with only a few columns of any intensity for any particular individual. We can try to extend the mixture model so that we can maintain model-based clustering with high-dimensional data (e.g., subspace clustering using the R package HDclassif). The key is to concentrate on the smaller intrinsic dimensionality responsible for specific cluster differences.

Yet, I would argue that nonnegative matrix factorization (NMF) might offer a more productive approach. This blog is filled with posts demonstrating how well NMF works with marketing data, which is reassuring. More importantly, the NMF decomposition corresponds closely with how products are represented in human cognition and memory and how product information is shared through social interactions and marketing communications.

Human decision making adapts to fit the demands of the problem task. In particular, what works with side-by-side comparisons across a handful of attributes for two or three alternatives in a consideration set will not fill our market baskets or help us select a meal from a long list of menu items. This was Herbert Simon's insight. Consumer segments are formed as individuals come to share a common understanding of what is available and what should be preferred. In order to make a choice, we are required to focus our attention on a subspace of all that is available. NMF mimics this simplification process, yielding interpretable building blocks as we attempt to learn the why of consumption.

Wednesday, May 13, 2015

What is Data Science? Can Topic Modeling Help?

Predictive analytics often serves as an introduction to data science, but it may not be the best exemplar given its long history and origins in statistics. David Blei, on the other hand, struggles to define data science through his work on topic modeling and latent Dirichlet allocation. In Episode 10 of Talking Machines, Blei discusses his attempt to design a curriculum for the Data Science Institute at Columbia University. The interview starts at 9:20. If you do not wish to learn about David's career, you can enter the conversation at 13:10. However, you might want to listen all the way to the end because we learn a great deal about data science by hearing how topic modeling is applied across disciplines. Over time data science will be defined as individuals calling themselves "data scientists" change our current practices.

The R Project for Statistical Computing assists by providing access to a diverse collection of applications across fields with differing goals and perspectives. Programming forces us into the details so that we cannot simply talk in generalities. Thus, topic modeling certainly allows us to analyze text documents, such as newspapers or open-ended survey comments. What about ingredients in food recipes? Or, how does topic modeling help us understand matrix factorization? The ability to "compare and contrast" marks a higher level of learning in Bloom's taxonomy.

While visiting Talking Machines, you might also want to download the MP3 files for some of the other episodes. The only way to keep up with the increasing number of R packages is to understand how they fit together into some type of organizational structure, which is what a curriculum provides.

You can hear Geoffrey Hinton, Yoshua Bengio and Yann LeCun discuss the history of deep learning in Episodes 5 and 6. If nothing else, the conversation will help you keep up as R adds packages for deep neural networks and representation learning. In addition, we might reconsider our old favorites, like predictive analytics, with a new understanding. For example, what may be predictive in choice modeling might not be the individual features as given in the product description but the holistic representation as perceived by a consumer with a history of similar purchases in similar situations. We would not discover that by estimating separate coefficients for each feature as we do with our current hierarchical Bayesian models. Happily, we can look elsewhere in R for models that can learn such a product representation.