Friday, January 8, 2016

A Data Science Solution to the Question "What is Data Science?"

As this flowchart from Wikipedia illustrates, data science is about collecting, cleaning, analyzing and reporting data. But is it data science or just or a "sexed up term" for Statistics (see embedded quote by Nate Silver)? It's difficult to separate the two at this level of generality, so perhaps we need to define our terms.


We begin by making a list of all the stuff that a data scientist might do or know. We are playing a game where the answer is "data scientist" and the questions are "Do they do this?" and "Do they know that?". However, the "this" and the "that" are very specific. For example, "Data is Processed" can range from simple downloading to the complex representation of visual or speech input. What precisely does a data scientist do when they process data that a programmer or a statistician does not do?

To be clear, I am constructing a very long questionnaire that I intend to distribute to individuals calling themselves data scientists along with everyone else claiming that they too do data science, although by another name. A checklist will work in our game of Twenty Questions as long as the list is detailed and exhaustive. You are welcome to add suggestions as comments to this post, but we can start by expanding on each of the boxes in the above data science flowchart.

Since I am a marketing researcher, I am inclined to analyze the resulting data matrix as if it were a shopping cart filled with items purchased from a grocery store or an inventory of downloads from a video or music provider. The rows are respondents, and the columns are all the questions that might be asked to distinguish among all the various players. Let's not include sexy as a column.

You may have guessed that I am headed toward some type of matrix factorization. Can we recognize patterns in the columns that reflect different configurations of study and behavior? Are there communities composed of rows clustered together with similar practices and experiences? R provides most of us who have some experience running factor and cluster analyses with a "doable" introduction to non-negative matrix factorization (NMF). You can think of it as simultaneous clustering of the rows and columns in a data matrix. My blog is filled with examples, none of which are easy, but none of which are incomprehensible or beyond your ability to adapt to your own datasets.

What are we likely to find? Will we discover something like anchor words from topic modeling? For instance, it is necessary to work with multiple datasets from different disciplines to be a data scientist? Would I stop calling myself a marketing scientist if I started working with political polling data? Some argue that one becomes a statistician when they begin consulting with others from divergent fields of study.

What about teaching to students with varied backgrounds in universities or industry? Do we call it data science if one writes and distributes software that others can apply with data across diverse domains? Does proving theorems make one a statistician? How many languages must one know before they are a programmer? What role does computation play when making such discriminations?

What will we learn from dissecting the "corpus" (the detailed body of what we do and know summarized by the boxes in the above data science process)? Extending this analogy, I am recommending that the "physician, heal thyself" by applying data science methodology to provide a response to the "What is Data Science?" question. 

Hopefully, we can avoid the hype and the caricature from the popular press (sexiest job of 21st century). Moreover, I suggest that we resist the tendency to think metaphorically in terms of contrasting ideals. The simple act of comparing statisticians and data scientists shapes our perceptions and leads us to see the two as more dissimilar than suggested by their training and behavior. The distinction may be more nuance than substance, reflecting what excites and motivates rather than what is known or done. The basis for separation may reside in how much personal satisfaction is derived from the subject matter or the programming rather than the computational algorithm or the generative model.

Wednesday, December 23, 2015

Modeling How Consumers Simplify the Purchase Process by Copying Others

A Flower That Fits the Bill
Marketing borrows the biological notion of coevolution to explain the progressive "fit" between products and consumers. While evolutionary time may seem a bit slow for product innovation and adoption, the same metaphor can be found in models of assimilation and accommodation from cultural and cognitive psychology.

The digital camera was introduced as an alternative to film, but soon redefined how pictures are taken, stored and shared. The selfie stick is but the latest step in this process by which product usage and product features coevolve over time with previous cycles enabling the next in the chain. Is it the smartphone or the lack of fun that's killing the camera?

The diffusion of innovation unfolds in the marketplace as a social movement with the behavior of early adopters copied by the more cautious. For example, "cutting the cord" can be a lifestyle change involving both social isolation from conversations among those watching live sporting events and a commitment to learning how to retrieve television-like content from the Internet. The Diary of a Cord-Cutter in 2015 offers a funny and informative qualitative account. Still, one needs the timestamp because cord-cutting is an evolving product category. The market will become larger and more diverse with more heterogeneous customers (assimilation) and greater differentiation of product offerings (accommodation).

So, we should be able to agree that product markets are the outcome of dynamic processes involving both producers and customers (see Sociocognitive Dynamics in a Product Market for a comprehensive overview). User-centered product design takes an additional step and creates fictional customers or personas in order to find the perfect match. Shoppers do something similar when they anticipate how they will use the product they are considering. User types can be real (an actual person) or imagined (a persona). If this analysis is correct, then both customers and producers should be looking at the same data: the cable TV customer to decide if they should become cord-cutters and the cable TV provider to identify potential defectors.

Identifying the Likely Cord-Cutter

We can ask about your subscriptions: cable TV, internet connection, Netflix, Hulu, Amazon Prime, Sling, and so on). It is a long list, and we might get some frequency of usage data at the same time. This may be all that we need, especially if we probe for the details (e.g., cable TV usage would include live sports, on-demand movies, kid's shows, HBO or other channel subscriptions, and continue until just before respondents become likely to terminate on-line surveys). Concurrently, it might be helpful to know something about your hardware, such as TVs, DVDs, DVRs, media streamers and other stuff.

A form of reverse engineering guides our data collection. Qualitative research and personal experience gives us some idea of the usage types likely to populate our customer base. Cable TV offers a menu of bundled and ala carte hardware and channels. Only some of the alternatives are mutually exclusive; otherwise, you are free to create your own assortment. Internet availability only increases the number of options, which you can watch on a television, a computer, a tablet or a phone. Plus, there are always free broadcast TV captured with an antenna and DVDs that you rent or buy. We ought not to forget DVRs and media streamers (e.g., Roku, Apple TV, Chromecast, and Amazon Fire Stick). Obviously, there is no reason to stop with usage so why not extend the scale to include awareness and familiarity? You might not be a cord-cutter, though you may be on your way if you know all about Sling TV.

Traditional segmentation will not be able to represent this degree of complexity.

Each consumer defines their own personal choices by arranging options in a continually changing pattern that does not depend on existing bundles offered by providers. Consequently, whatever statistical model is chosen must be open to the possibility that every non-contradictory arrangement is possible. Yet, every combination will not survive for some will be dominated by others and never achieve a sustainable audience.

We could display this attraction between consumers and offerings as a bipartite graph (Figure 2.9 from Barabasi's Network Science).


Consumers are listed in U, and a line is drawn to the offerings in V that they might wish to purchase (shown in the center panel). It is this linkage between U and V that produces the consumer and product networks in the two side panels. The A-B and B-C-D cliques of offerings in Projection V would be disjoint without customer U_5. Moreover, the 1-2-3 and 4-5-6-7 consumer clusters are connected by the presence of offering B in V. Removing B or #5 cuts the graph into independent parts.

Actual markets contain many more consumers in U, and the number of choices in V can be extensive. Consumer heterogeneity creates complexities for the marketer trying to discover structure in Projection U. Besides, the task is not any easier for an individual consumer who must select the best from a seemingly overwhelming number of alternatives in Projection V. Luckily, one trick frees the consumer from having to learn all the options that are available and being forced to make all the difficult tradeoffs - simply do as others do (as in observational learning). The other can be someone you know or read about as in the above Diary of a Cord-Cutter in 2015. There is no need for a taxonomy of offerings or a complete classification of user types.

In fact, it has become popular to believe that social diffusion or contagion models describe the actual adoption process (e.g., The Tipping Point). Regardless, over time, the U's and V's in the bipartite interactions of customers and offerings come to organize each other through mutual influence. Specifically, potential customers learn about the cord-cutting persona through the social and professional media and at the same time come to group together those offerings that the cord-cutter might purchase. Offerings are not alphabetized or catalogued as an academic exercise. There is money to be saved and entertainment to be discovered. Sorting needs to be goal-directed and efficient. I am ready to binge-watch, and I am looking for a recommendation.

"I'll Have What She's Having"

It has taken some time to outline how consumers are able to simplify complex purchase process by modeling the behavior of others. It is such a common experience, although rational decision theory continues to control our statistical modeling of choice. As you are escorted to your restaurant table, you cannot help but notice a delicious meal being served next to where you are seated. You refuse a menu and simply ask for the same dish. "I'll Have What She's Having" works as a decision strategy only when I can identify the "she" and the "what" simultaneously.

If we intend to analyze that data we have just talked about collecting, we will need a statistical model. Happily, the R Project for Statistical Computing implements at least two approaches for such joint identification: a latent clustering of a bipartite network in the latentnet package and a nonnegative matrix factorization in the NMF package. The Davis data from the latentnet R package will serve as our illustration. The R code for all the analyses that will be reported can be found at the end of this post.

Stephen Borgatti is a good place to begin with his two-mode social network analysis of the Davis data. The rows are 18 women, the columns are 14 events, and the cells are zero or one depending on whether or not each woman attended each event. The nature of the events has not been specified, but since I am in marketing, I prefer to think of the events as if they were movies seen or concerts attended (i.e., events requiring the purchase of tickets). You will find a latentnet tutorial covering the analysis of this same data as a bipartite network (section 6.3). Finally, a paper by Michael Brusco called "Analysis of two-mode network data using nonnegative matrix factorization" provides a detailed treatment of the NMF approach.

We will start with the plot from the latentnet R package. The names are the women in the rows and the numbered E's are the events in the columns. The events appear to be separated into two groups of E1 to E6 toward the top and E9 to E14 toward the bottom. E7 and E8 seem to occupy a middle position. The names are also divided into an upper and lower grouping with Ruth and Pearl falling between the two clusters. Does this plot not look similar to the earlier bipartite graph from Barabasi? That is, the linkages between the women and the events organize both into two corresponding clusters tied together by at least two women and two events.

The heatmaps from the NMF reveal the same pattern for the events and the women. You should recall that NMF seeks a lower dimensional representation that will reproduce the original data table with 0s and 1s. In this case, two basis components were extracted. The mixture coefficients for the events vary from 0 to 1 with a darker red indicating a higher contribution for that basis component. The first six events (E1-E6) form the first basis component with the second basis component containing the last six events (E9-E14). As before, E7 and E8 share a more even mixture of the two basis components. Again, the most of the women load on one basis component or the other with Ruth and Pearl traveling freely between both components. As you can easily verify, the names form the same clusters in both plots.


It would help to know something about the events and the women. If E1 through E6 were all of a certain type (e.g., symphony concerts), then we could easily name the first component. Similarly, if all of the women in red at bottom of our basis heatmap played the piano, our results would have at least face validity. A more detailed description of this naming process can be found in a previous example called "What Can We Learn from the Apps on Your Smartphone?". Those wishing to learn more might want to review the link listed at the end of that post in a note.

Which events should a newcomer attend? If Helen, Nora, Sylvia and Katherine are her friends, the answer is the second cluster of E9-E14. The collaborative filtering of recommender systems enables a novice to decide quickly and easily without a rational appraisal of the feature tradeoffs. Of course, a tradeoff analysis will work as well for we have a joint scaling of products and users. If the event is a concert with a performer you love, then base your decision on a dominating feature. When in tradeoff doubt, go along with your friends.

Finally, brand management can profit from this perspective. Personas work as a design strategy when user types are differentiated by their preference structures and a single individual can represent each group. Although user-centered designers reject segmentations that are based on demographics, attitudes, or benefit statements, a NMF can get very specific and include as many columns as needed (e.g., thousands of movie and even more music recordings). Furthermore, sparsity is not a problem and most of the rows can be empty.

There is no reason why each of the basis components in the above heatmaps could not be summarized by one person and/or one event. However, NMF forms building blocks by jointly clustering many rows and columns. Every potential customer and every possible product configuration are additive compositions built from these blocks. Would not design thinking be better served with several exemplars of each user type rather than trying to generalize from a single individual? Plus, we have the linked columns telling us what attracts each user type in the desired detail provided by the data we collected.


R Code to Produce Plots
library(latentnet)
data(davis)
davis.fit<-ergmm(davis~bilinear(d=2)+rsociality)
plot(davis.fit,pie=TRUE,rand.eff="sociality",labels=TRUE)
 
library(NMF)
data_matrix<-as.matrix.network(davis)
fit<-nmf(data_matrix, 2, "lee", nrun=20)
par(mfrow = c(1, 2))
basismap(fit)
coefmap(fit)
Created by Pretty R at inside-R.org

Friday, December 18, 2015

BayesiaLab-Like Network Graphs for Free with R

My screen has been filled with ads from BayesiaLab since I downloaded their free book. Just as I began to have regrets, I received an email invitation to try out their demo datasets. I was especially interested in their perfume ratings data. In this monadic product test, each of 1,321 French women was presented with only one of 11 perfumes and asked to evaluate on a 10-point scale a series of fragrance-related adjectives along with a few user-imagery descriptors. I have added the 6-point purchase intent item to the analysis in order to assess its position in this network.

Can we start by looking at the partial correlation network? I will refer you to my post on Driver Analysis vs. Partial Correlation Analysis and will not repeat that more detailed overview.

Each of the nodes is a variable (e.g., purchase intent is located on the far right). An edge drawn between any two nodes shows the partial correlation between those two nodes after controlling for all the other variables in the network. The color indicates the sign of the partial correlation with green for positive and red for negative. The size of the partial correlation is indicated by the thickest of the edge.

Simply scanning the map reveals the underlying structure of global connections among even more strongly joined regions:

  • Northwest - In Love / Romantic / Passionate / Radiant,
  • Southwest - Bold / Active / Character / Fulfilled / Trust / Free, 
  • Mid-South - Classical / Tenacious / Quality / Timeless / High End, 
  • Mid-North - Wooded / Spiced, 
  • Center - Chic / Elegant / Rich / Modern, 
  • Northeast - Sweet / Fruity / Flowery / Fresh, and
  • Southeast - Easy to Wear / Please Others / Pleasure. 

Unlike the Probabilistic Structural Equation Model (PSEM) in Chapter 8 of BayesiaLab's book, my network is undirected because I can find no justification for assigning causality. Yet, the structure appears to be much the same for the two analyses, for example, compare this partial correlation network with BayesiaLab's Figure 8.2.3.

All this looks very familiar to those of us who have analyzed consumer rating scales. First, we expect negative skew and high collinearity because consumers tend to give ratings in the upper end of the scale and their responses often are highly intercorrelated. In fact, the first principal component accounted for 64% of the total variation, and it would have been higher had Wooded and Spiced been excluded from the battery.

A more cautious researcher might stop with extracting a single dimension and simply concluding that the women either liked or disliked the perfumes they tested and rated everything either uniformly higher or lower. They would speak of halo effects and question whether any more than an overall score could be extracted from the data. Nevertheless, as we see from the above partial correlation network, there is an interpretable local structure even when all the variables are highly interrelated.

I have discussed this issue before in a post about separating global from specific factors. The bifactor model outlined in that post provides another view into the structure of the perfume rating data. What if there were a global factor explaining what we might call the "halo effect" (i.e., uniformly high correlations among all the variables) and then additional specific factors accounting for the extra correlation among different subsets of variables (e.g., the regions in the above partial correlation network map)?

The bifactor diagram shown below may not be pretty with so many variables to be arrayed. However, you can see the high factor loadings radiating out from the global factor g and how the specific factors F1* through F6* provide a secondary level of local structure corresponding to the regions identified in the above network.


I will end with a technical note. The 1321 observations were nested within the 11 perfumes with each respondent seeing only one perfume. Although we would not expect the specific perfume rated to alter the correlations (factorial invariance), mean-level differences between the perfumes could inflate the correlations calculated over the entire sample. In order to test this, I reran the analysis with deviation scores by subtracting the corresponding mean perfume score from each respondent's original ratings. The results were essentially the same.



R Code Needed to Import CSV File and Produce Plots

# Set working directory and import data file
setwd("C:/directory where file located")
perfume<-read.csv("Perfume.csv", sep=";")
apply(perfume, 2, function(x) table(x,useNA="always"))
 
# Calculates Sparse Partial Correlation Matrix
library("qgraph")
sparse_matrix<-EBICglasso(cor(perfume[,2:48]), n=1321)
qgraph(sparse_matrix, layout="spring", 
       label.scale=FALSE, labels=names(perfume)[2:48],
       label.cex=1, node.width=.5)
 
library(psych)
# Purchase Intent Not Included
scree(perfume[,3:48])
omega(perfume[,3:48], nfactors=6)
Created by Pretty R at inside-R.org

Thursday, December 10, 2015

Attitudes Modeled as Networks


In case you missed it, Jonas Dalege and his colleagues at the PsychoSystems research group have recently published an article in Psychological Review detailing how attitudes can be represented as network graphs. It is all done using R and a dataset that can be downloaded by registering at the ANES data center. You will find the R code under Scripts and Code in a file called ANES 1984 Analyses. With very minor changes to the size of some labeling, I was able to reproduce the above undirected graph with two R packages: IsingFit and qgraph. As usual when downloading others' files, most of the R code is data munging and deals with assigning labels and transforming ratings into dichotomies.

The above graph represents the conditional independence relationships among node pairings. Specifically, edges are drawn between pairs of nodes only if they are still related after controlling for all the other nodes not in that pair. The center nodes in red are assessments of Ronald Reagan's ability, decency and caring. The groupings of the red nodes seem reasonable, for example, the thicker green edges connected knowledgeable, hard-working, decent and moral. Similarly, in touch, understands and cares are also drawn together by stronger relationships. These evaluative judgments are joined by positive green edges to the respondents' feelings of pride and hope (blue nodes). Moreover, they are pushed away by negative red pathways from darker emotional reactions such as fear, anger and disgust (green nodes).

One should not be surprised to learn that it makes a difference whether the attitudes are scored dichotomously (e.g., yes/no, agree/disagree or present/absent) or using some ordinal rating scale. If it helps, you can think of this as you might the distinction between regression (continuous) and classification (discrete) in statistical learning theory. Thus, when I analyzed a set of mobile phone ratings gathered with 10-point scales, I borrowed a graphical lasso model called EBICglasso from the qgraph R package (see Undirected Graphs When the Causality is Mutual). On the other hand, the Ising model from the IsingFit R package was needed when the data came from yes/no checklists (see The Network Underlying Consumer Perceptions of the European Car Market).


Friday, December 4, 2015

The Topology Underlying the Brand Logo Naming Game: Unidimensional or Local Neighborhoods?

You can find the app on iTunes and Google Play. It's a game of trivial pursuits - here's the logo, now tell me the brand. Each item is scored as right or wrong, and the players must take it all very seriously for there is a Facebook page with cheat sheets for improving one's total score.

Psychometrics Sees Everything as a Test

What would a psychometrician make of such a game based on brand logo knowledge? Are we measuring one's level of consumerism ("a preoccupation with and an inclination toward buying consumer goods")? Everyone knows the most popular brands, but only the most involved are familiar with logos of the less publicized products. The question for psychometrics is whether they are able to explain the logos that you can identify correctly by knowing only your level of consumption.

For example, if you were a car enthusiast, then you would be able to name all the car logos in the above table. However, if you did not drive a car or watch commercial television or read car ads in print media, you might be familiar with only the most "popular" logos (i.e., the ones that cannot be avoided because their signage is everywhere you look). We make the assumption that everyone falls somewhere between these two extremes along a consumption continuum and assess whether we can reproduce every individual pattern of answers based solely on their location on this single dimension. Shopping intensity or consumerism is the path, and logo identifications are the sensors along that path.

Specifically, if some number of N respondents played this game, it would not be difficult to rank order the 36 logos in the above table along a line stretching from 0% to 100% correct identification. Next, we examine each respondent, starting by sorting the players from those with the fewest correct identifications to those getting the most right. As shown in an earlier post, a heatmap will reveal the relationship between the ease of identifying each logo and the overall logo knowledge for each individual, as measured by their total score over all the brand logos. [The R code required to simulate the data and produce the heatmap can be found at the end of this post.]

You can begin by noting that blue is correct and red is not. Thus, the least knowledgeable players are in the top rows filled with the most red and the least blue. The logos along the x-axis are sorted by difficulty with the hardest to name on the left and the easiest on the right. In general, better players tend to know the harder logos. This is shown by the formation of a blue triangle as one scans towards the lower, right-hand corner. We call this a Guttman scale, and it suggests that both variation among the logos and the players can be described by a single dimension, which we might call logo familiarity or brand presence. However, one must be wary of suggestive names like "brand presence" for over time we forget that we are only measuring logo familiarity and not something more impactful.

Our psychometrician might have analyzed this same data using the R package ltm for latent trait modeling. A hopefully intuitive introduction to item response modeling was posted earlier on this blog. Those results could be summarized with a series of item characteristic curves displaying the relationship between the probability of answering correctly and the underlying trait, labeled ability by default.

As you see in the above plot, the items are arranged from the easiest (V1) to the hardest (V36) with the likelihood of naming the logo increasing as a logistic function of the unobserved consumerism measured as z-scores and called ability because item response theory (IRT) originated in achievement testing. These curves are simple to read and understand. A player with low consumption (e.g., a z-score near -2) has a better than even chance of identifying the most popular logos, but almost zero probability of naming any of the least familiar logos. All those probabilities move up their respective S-curves together as consumers become more involved.

In this example the function has been specified for I have plotted the item characteristics curves from the one-parameter Rasch model. However, a specific functional form is not required, and we could have used the R package KernSmoothIRT to fit a nonparametric model. The topology remains a unidimensional manifold, something similar to Hastie's principal curve in the R package princurve. Because the term has multiple meanings, I should note that I am using "topology" in a limited sense in order to refer to the shape of the data and not as in topological data analysis.

To be clear, there must be powerful forces at work to constrain logo naming to a one-dimensional continuum. Sequential skills that build on earlier achievements can often be described by a low-dimensional manifold (e.g., learning descriptive statistics before attempting inference since the latter assumes knowledge of the former). We would have needed a different model had our brands been local so that higher shopping intensity would have produced greater familiarity only for those logos available in a given locality (e.g., country-specific brands without an international presence).


The Meaning of Brand Familiarity Depends on Brand Presence in Local Markets

Now, it gets interesting. We started with players differentiated by a single parameter indicating how far they had traveled along a common consumption path. The path markers or sensors are the logos arrayed in decreasing popularity. Everyone shares a common environment with similar exposures to the same brand logos. Most have seen the McDonald's double-arcing M or the Nike swoosh because both brands have spent a considerable amount of money to buy market presence. On the other hand, Hilton's "blue H in the swirl" with less market presence would be recognized less often (fourth row and first column in the above brand logo table).

But what if market presence and thus logo popularity depended on your local neighborhood? Even international companies have differential presence in different countries, as well as varying concentration within the same country. Spending and distribution patterns by national, regional and local brands create clusters of differential market presence. Everyone does not share a common logo exposure so that each cluster requires its own brand list. That is, consumers reside in localities with varying degrees of brand presence so that two individuals with identical levels of consumption intensity or consumerism would not be familiar with the same brand logos. Consequently, we need to add a second parameter to each individual's position along a path specific to their neighborhood. The psychometrician calls this differential item functioning (DIF), and R provides a number of ways of handling the additional mixture parameter.


Overlapping Audiences in the Marketplace of Attention

You may have anticipated the next step as the topology becomes more complex. We began with one pathway marked with brand logos as our sensors. Then, we argued for a mixture model with groups of individuals living in different neighborhoods with different ordering of the brand logos. Finally, we will end by allowing consumers to belong to more than one neighborhood with whatever degree of belonging they desire. We are describing the kind of fragmentation that occurs when consumers seize control and there is more available to them than they can attend to or consider. James Webster outlines this process of audience formation in his book The Marketplace of Attention.

The topology has changed again. There are just too many brand logos, and unless it becomes a competitive game, consumers will derive diminishing returns from continuing search and they typically will stop sooner rather than later. It helps that the market comes preorganized by providers trying to make the sale. Expert reviews and word of mouth guide the search. Yet, it is the consumer who decides what to select from the seemingly endless buffet. In the process, an individual will see and remember only a subset of all possible brand logos. We need a new model - one that simultaneously sorts both rows and columns by grouping together consumers and the brand logos that they are likely to recognize.

A heatmap may help to explain what can be accomplished when we search for joint clusterings of the rows and columns (also known as biclustering). Using an R package for nonnegative matrix factorization (NMF), I will simulate a data set with such a structure and show you the heatmap. Actually, I will display two heatmaps, one without noise so that you can see the pattern and a second with the same pattern but with added noise. Hopefully, the heatmap without noise will enable you to see the same pattern in the second heatmap with additional distortions.



I kept the number of columns at 36 for comparison with the first one-dimensional heatmap that you saw toward the beginning of this post. As before, blue is one, and red is zero. We discover enclaves or silos in the first heatmap without noise (polarization). The boundaries become fuzzier with random variation (fragmentation). I should note that you can see the biclusters in both heatmaps without reordering the rows and columns only because this is how the simulator generates the data. If you wish to see how this can be done with actual data, I have provided a set of links with the code needed to run a NMF in R at the end of my post on Brand and Product Category Representation.

Finally, although we speak of NMF as a form of simultaneous clustering, the cluster membership are graded rather than all-or-none (soft vs. hard clustering). This yields a very flexible and expressive topology, which becomes clear when we review the three alternative representations presented in this post. First, we saw how some highly structured data matrices can be reproduced using a single dimension with rows and columns both located on the same continuum (IRT). Next, we asked if there might be discrete groups of rows with each row cluster having its own unique ordering of the columns (mixed IRT). Lastly, we sought a model of audience formation with rows and columns jointly collected together into blocks with graded membership for both the rows and the columns (NMF).

Knowledge is organized as a single dimension when learning is formalized within a curriculum (e.g., a course at an educational institution) or accumulative (e.g., need to know addition before one can learn multiplication). However, coevolving networks of customers and products cannot be described by any one dimension or even a finite mixture of different dimensions. The Internet creates both microgenres and fragmented audiences that require their own topology.

R Code to Produce Figures in this Post

# use psych package to simulate latent trait data
library(psych)
logos<-sim.irt(nvar=36, n=500, mod="logistic")
 
# Sort data by both item mean
# and person total score
item<-apply(logos$items,2,mean)
person<-apply(logos$items,1,sum)
logos$itemsOrd<-logos$items[order(person),order(item)]
 
# create heatmap
# may need to increase size of plots window in R studio
library(gplots)
heatmap.2(logos$itemsOrd, Rowv=FALSE, Colv=FALSE, 
          dendrogram="none", col=redblue(16), 
          key=T, keysize=1.5, density.info="none", 
          trace="none", labRow=NA)
 
library(ltm)
# two-parameter logistic model
fit<-ltm(logos$items ~ z1)
summary(fit)
 
# item characteristic curves
plot(fit)
 
# constrains slopes to be equal
fit2<-rasch(logos$items)
plot(fit2)
summary(fit2)      
 
library(NMF)
# generate a synthetic dataset with 
# 500 rows and three groupings of
# columns (1-10, 11-20, and 21-36)
n <- 500
counts <- c(10, 10, 16)
 
# no noise
V1 <- syntheticNMF(n, counts, noise=FALSE)
V1[V1>0]<-1
 
# with noise
V2 <- syntheticNMF(n, counts)
V2[V2>0]<-1
 
# produce heatmap with and without noise
heatmap.2(V1, Rowv=FALSE, Colv=FALSE, 
          dendrogram="none", col=redblue(16), 
          key=T, keysize=1.5, density.info="none", 
          trace="none", labRow=NA)
heatmap.2(V2, Rowv=FALSE, Colv=FALSE, 
          dendrogram="none", col=redblue(16), 
          key=T, keysize=1.5, density.info="none", 
          trace="none", labRow=NA)
Created by Pretty R at inside-R.org

Wednesday, December 2, 2015

The Statistician as Data Action Hero

"If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools."
Leo Breiman

Breiman ends the opening abstract of his "manifesto" to the statistical community with the above call to action. As a statistics professor at Berkeley, he was not an outsider.

Unfortunately, the warning was not heeded and now statistics laments with the President of the American Statistical Association asking "Aren't We Data Science?". Interestingly, one of the comments objects to a suggestion in this editorial that statisticians ought to learn R.

Clearly, someone has missed the point. It is not what you call yourself, who signs your paycheck, or the size of your data sets. It is your openness to disruptive innovation and your desire to make a difference. R provides the interface that makes this possible.

Breiman states it well at the end of an interview from 2001,
So I think if I were advising a young person today, I would have some reservations about advising him or her to go into statistics, but probably, in the end, I would say, “Take statistics, but remember that the great adventure of statistics is in gathering and using data to solve interesting and important real world problems.”

Tuesday, November 24, 2015

Statistical Models That Support Design Thinking: Driver Analysis vs. Partial Correlation Networks

We have been talking about design thinking in marketing since Tim Brown's Harvard Business Review article in 2008. It might be easy for the data scientist to dismiss the approach as merely a type of brainstorming for new products or services. Yet, design issues do arise in data visualization where we are concerned with communicating our findings. However, my interest is model selection: Should the analyst select one statistical model over another because the user might find it more helpful in planning interventions or designing new products and services?

For example, the marketing manager who wants to retain current customers seeks guidance from customer satisfaction questionnaires filled with performance ratings and intentions to recommend or purchase again. Motivated by the desire to keep it simple, common practice tends to focus attention on only the most important "causes" of customer retention. As I noted in my first post, Network Visualization of Key Driver Analysis, a more complete picture can be revealed by a correlation graph displaying all the interconnections among all the ratings. The edges or links are colored green or red so that we know if the relationship is positive or negative. The thickest of the path indicates the strength of the correlation. But correlations measure total effects, both those that are direct and those obtained through associations with other ratings.

The designer of intervention strategies aimed at preventing churn could acquire additional insights from the partial correlation graph depicting the effects between all pairs of ratings controlling for all the other ratings in the model. While the correlation map reveals total effects, the partial correlation map removes all but the direct effects. The graph below was created using the R code from my first post to simulate a data set that mimics what is often found when airline passengers complete satisfaction surveys. Once the data were generated, the procedures outlined in my post Undirected Graphs When the Causality is Mutual were followed. The R code is listed at the end of this discussion.


We can pick any node, such as the one labeled "Satisfaction" in the middle of the right-hand side of the figure. A simple way of interpreting this graph is to think of Satisfaction as the dependent variable and the lines radiating from Satisfaction as the weights obtained from the regression of this node on the other 14 ratings. Clearly, overall satisfaction serves as an inclusive summary measure with so many pathways from so many other nodes. Each of the four customer service ratings (below Satisfaction and in light pink) adds its own unique contribution with the greatest impact indicated by the thickest green edge from the Service node. Moreover, Easy Reservation and Ticket Price plus Clean Aircraft with room for people and baggage make incremental improvements in Satisfaction.

The same process can be repeated for any node. Instead of a driver analysis that narrows our thinking to a single dependent variable and its highest regression weights, the partial correlation map opens us to the possibilities. If the goal was customer retention, then the focus would be on the Fly Again node. Recommend seems to have the strongest link to Fly Again. Can the airline induce repeat purchase by encouraging recommendation? What if frequent flyer miles were offered when others entered your name as a recommender? Such a proposal may not be practical in its current form, but the graph supports this type of design thinking.

Because there are no direct paths from the four service nodes to Fly Again, a driver analysis would miss the indirect connection through Satisfaction. And what of this link between Courtesy and Easy Reservation? Do customers infer a "friendly" personality trait that links their perceptions of the way they are treated when they buy a ticket and when they board the plane? Design thinkers would entertain such a possibility and test the hypothesis. Such "cascaded inferences" fill the graph for those willing to look. Perhaps many small and less costly improvements might combine to have a greater impact than concentrating on a single aspect? Encouraging passengers to check their bags would create more overhead storage without reconfiguring the airplane. Let the design thinking begin!

The discussion ends with the identification of "the most important" in a driver analysis. The network, on the other hand, invites creative thought. Isn't this the point of data science? What can we learn from the data? The answer is a good deal more than can be revealed by the largest coefficient in a single regression equation.

# Calculates Sparse Partial Correlation Matrix
sparse_matrix<-EBICglasso(cor(ratings), n=1000)
round(sparse_matrix,2)
 
# Plots results
gr<-list(1:4,5:8,9:12,13:15)
node_color<-c("lightgoldenrod","lightgreen","lightpink","cyan")
qgraph(sparse_matrix, fade = FALSE, layout="spring", groups=gr, 
       color=node_color,labels=names(ratings), label.scale=FALSE, 
       label.cex=1, node.width=.5, edge.width=.25, minimum=.05)
Created by Pretty R at inside-R.org