Wednesday, December 23, 2015

Modeling How Consumers Simplify the Purchase Process by Copying Others

A Flower That Fits the Bill
Marketing borrows the biological notion of coevolution to explain the progressive "fit" between products and consumers. While evolutionary time may seem a bit slow for product innovation and adoption, the same metaphor can be found in models of assimilation and accommodation from cultural and cognitive psychology.

The digital camera was introduced as an alternative to film, but soon redefined how pictures are taken, stored and shared. The selfie stick is but the latest step in this process by which product usage and product features coevolve over time with previous cycles enabling the next in the chain. Is it the smartphone or the lack of fun that's killing the camera?

The diffusion of innovation unfolds in the marketplace as a social movement with the behavior of early adopters copied by the more cautious. For example, "cutting the cord" can be a lifestyle change involving both social isolation from conversations among those watching live sporting events and a commitment to learning how to retrieve television-like content from the Internet. The Diary of a Cord-Cutter in 2015 offers a funny and informative qualitative account. Still, one needs the timestamp because cord-cutting is an evolving product category. The market will become larger and more diverse with more heterogeneous customers (assimilation) and greater differentiation of product offerings (accommodation).

So, we should be able to agree that product markets are the outcome of dynamic processes involving both producers and customers (see Sociocognitive Dynamics in a Product Market for a comprehensive overview). User-centered product design takes an additional step and creates fictional customers or personas in order to find the perfect match. Shoppers do something similar when they anticipate how they will use the product they are considering. User types can be real (an actual person) or imagined (a persona). If this analysis is correct, then both customers and producers should be looking at the same data: the cable TV customer to decide if they should become cord-cutters and the cable TV provider to identify potential defectors.

Identifying the Likely Cord-Cutter

We can ask about your subscriptions: cable TV, internet connection, Netflix, Hulu, Amazon Prime, Sling, and so on). It is a long list, and we might get some frequency of usage data at the same time. This may be all that we need, especially if we probe for the details (e.g., cable TV usage would include live sports, on-demand movies, kid's shows, HBO or other channel subscriptions, and continue until just before respondents become likely to terminate on-line surveys). Concurrently, it might be helpful to know something about your hardware, such as TVs, DVDs, DVRs, media streamers and other stuff.

A form of reverse engineering guides our data collection. Qualitative research and personal experience gives us some idea of the usage types likely to populate our customer base. Cable TV offers a menu of bundled and ala carte hardware and channels. Only some of the alternatives are mutually exclusive; otherwise, you are free to create your own assortment. Internet availability only increases the number of options, which you can watch on a television, a computer, a tablet or a phone. Plus, there are always free broadcast TV captured with an antenna and DVDs that you rent or buy. We ought not to forget DVRs and media streamers (e.g., Roku, Apple TV, Chromecast, and Amazon Fire Stick). Obviously, there is no reason to stop with usage so why not extend the scale to include awareness and familiarity? You might not be a cord-cutter, though you may be on your way if you know all about Sling TV.

Traditional segmentation will not be able to represent this degree of complexity.

Each consumer defines their own personal choices by arranging options in a continually changing pattern that does not depend on existing bundles offered by providers. Consequently, whatever statistical model is chosen must be open to the possibility that every non-contradictory arrangement is possible. Yet, every combination will not survive for some will be dominated by others and never achieve a sustainable audience.

We could display this attraction between consumers and offerings as a bipartite graph (Figure 2.9 from Barabasi's Network Science).

Consumers are listed in U, and a line is drawn to the offerings in V that they might wish to purchase (shown in the center panel). It is this linkage between U and V that produces the consumer and product networks in the two side panels. The A-B and B-C-D cliques of offerings in Projection V would be disjoint without customer U_5. Moreover, the 1-2-3 and 4-5-6-7 consumer clusters are connected by the presence of offering B in V. Removing B or #5 cuts the graph into independent parts.

Actual markets contain many more consumers in U, and the number of choices in V can be extensive. Consumer heterogeneity creates complexities for the marketer trying to discover structure in Projection U. Besides, the task is not any easier for an individual consumer who must select the best from a seemingly overwhelming number of alternatives in Projection V. Luckily, one trick frees the consumer from having to learn all the options that are available and being forced to make all the difficult tradeoffs - simply do as others do (as in observational learning). The other can be someone you know or read about as in the above Diary of a Cord-Cutter in 2015. There is no need for a taxonomy of offerings or a complete classification of user types.

In fact, it has become popular to believe that social diffusion or contagion models describe the actual adoption process (e.g., The Tipping Point). Regardless, over time, the U's and V's in the bipartite interactions of customers and offerings come to organize each other through mutual influence. Specifically, potential customers learn about the cord-cutting persona through the social and professional media and at the same time come to group together those offerings that the cord-cutter might purchase. Offerings are not alphabetized or catalogued as an academic exercise. There is money to be saved and entertainment to be discovered. Sorting needs to be goal-directed and efficient. I am ready to binge-watch, and I am looking for a recommendation.

"I'll Have What She's Having"

It has taken some time to outline how consumers are able to simplify complex purchase process by modeling the behavior of others. It is such a common experience, although rational decision theory continues to control our statistical modeling of choice. As you are escorted to your restaurant table, you cannot help but notice a delicious meal being served next to where you are seated. You refuse a menu and simply ask for the same dish. "I'll Have What She's Having" works as a decision strategy only when I can identify the "she" and the "what" simultaneously.

If we intend to analyze that data we have just talked about collecting, we will need a statistical model. Happily, the R Project for Statistical Computing implements at least two approaches for such joint identification: a latent clustering of a bipartite network in the latentnet package and a nonnegative matrix factorization in the NMF package. The Davis data from the latentnet R package will serve as our illustration. The R code for all the analyses that will be reported can be found at the end of this post.

Stephen Borgatti is a good place to begin with his two-mode social network analysis of the Davis data. The rows are 18 women, the columns are 14 events, and the cells are zero or one depending on whether or not each woman attended each event. The nature of the events has not been specified, but since I am in marketing, I prefer to think of the events as if they were movies seen or concerts attended (i.e., events requiring the purchase of tickets). You will find a latentnet tutorial covering the analysis of this same data as a bipartite network (section 6.3). Finally, a paper by Michael Brusco called "Analysis of two-mode network data using nonnegative matrix factorization" provides a detailed treatment of the NMF approach.

We will start with the plot from the latentnet R package. The names are the women in the rows and the numbered E's are the events in the columns. The events appear to be separated into two groups of E1 to E6 toward the top and E9 to E14 toward the bottom. E7 and E8 seem to occupy a middle position. The names are also divided into an upper and lower grouping with Ruth and Pearl falling between the two clusters. Does this plot not look similar to the earlier bipartite graph from Barabasi? That is, the linkages between the women and the events organize both into two corresponding clusters tied together by at least two women and two events.

The heatmaps from the NMF reveal the same pattern for the events and the women. You should recall that NMF seeks a lower dimensional representation that will reproduce the original data table with 0s and 1s. In this case, two basis components were extracted. The mixture coefficients for the events vary from 0 to 1 with a darker red indicating a higher contribution for that basis component. The first six events (E1-E6) form the first basis component with the second basis component containing the last six events (E9-E14). As before, E7 and E8 share a more even mixture of the two basis components. Again, the most of the women load on one basis component or the other with Ruth and Pearl traveling freely between both components. As you can easily verify, the names form the same clusters in both plots.

It would help to know something about the events and the women. If E1 through E6 were all of a certain type (e.g., symphony concerts), then we could easily name the first component. Similarly, if all of the women in red at bottom of our basis heatmap played the piano, our results would have at least face validity. A more detailed description of this naming process can be found in a previous example called "What Can We Learn from the Apps on Your Smartphone?". Those wishing to learn more might want to review the link listed at the end of that post in a note.

Which events should a newcomer attend? If Helen, Nora, Sylvia and Katherine are her friends, the answer is the second cluster of E9-E14. The collaborative filtering of recommender systems enables a novice to decide quickly and easily without a rational appraisal of the feature tradeoffs. Of course, a tradeoff analysis will work as well for we have a joint scaling of products and users. If the event is a concert with a performer you love, then base your decision on a dominating feature. When in tradeoff doubt, go along with your friends.

Finally, brand management can profit from this perspective. Personas work as a design strategy when user types are differentiated by their preference structures and a single individual can represent each group. Although user-centered designers reject segmentations that are based on demographics, attitudes, or benefit statements, a NMF can get very specific and include as many columns as needed (e.g., thousands of movie and even more music recordings). Furthermore, sparsity is not a problem and most of the rows can be empty.

There is no reason why each of the basis components in the above heatmaps could not be summarized by one person and/or one event. However, NMF forms building blocks by jointly clustering many rows and columns. Every potential customer and every possible product configuration are additive compositions built from these blocks. Would not design thinking be better served with several exemplars of each user type rather than trying to generalize from a single individual? Plus, we have the linked columns telling us what attracts each user type in the desired detail provided by the data we collected.

R Code to Produce Plots
fit<-nmf(data_matrix, 2, "lee", nrun=20)
par(mfrow = c(1, 2))
Created by Pretty R at

Friday, December 18, 2015

BayesiaLab-Like Network Graphs for Free with R

My screen has been filled with ads from BayesiaLab since I downloaded their free book. Just as I began to have regrets, I received an email invitation to try out their demo datasets. I was especially interested in their perfume ratings data. In this monadic product test, each of 1,321 French women was presented with only one of 11 perfumes and asked to evaluate on a 10-point scale a series of fragrance-related adjectives along with a few user-imagery descriptors. I have added the 6-point purchase intent item to the analysis in order to assess its position in this network.

Can we start by looking at the partial correlation network? I will refer you to my post on Driver Analysis vs. Partial Correlation Analysis and will not repeat that more detailed overview.

Each of the nodes is a variable (e.g., purchase intent is located on the far right). An edge drawn between any two nodes shows the partial correlation between those two nodes after controlling for all the other variables in the network. The color indicates the sign of the partial correlation with green for positive and red for negative. The size of the partial correlation is indicated by the thickest of the edge.

Simply scanning the map reveals the underlying structure of global connections among even more strongly joined regions:

  • Northwest - In Love / Romantic / Passionate / Radiant,
  • Southwest - Bold / Active / Character / Fulfilled / Trust / Free, 
  • Mid-South - Classical / Tenacious / Quality / Timeless / High End, 
  • Mid-North - Wooded / Spiced, 
  • Center - Chic / Elegant / Rich / Modern, 
  • Northeast - Sweet / Fruity / Flowery / Fresh, and
  • Southeast - Easy to Wear / Please Others / Pleasure. 

Unlike the Probabilistic Structural Equation Model (PSEM) in Chapter 8 of BayesiaLab's book, my network is undirected because I can find no justification for assigning causality. Yet, the structure appears to be much the same for the two analyses, for example, compare this partial correlation network with BayesiaLab's Figure 8.2.3.

All this looks very familiar to those of us who have analyzed consumer rating scales. First, we expect negative skew and high collinearity because consumers tend to give ratings in the upper end of the scale and their responses often are highly intercorrelated. In fact, the first principal component accounted for 64% of the total variation, and it would have been higher had Wooded and Spiced been excluded from the battery.

A more cautious researcher might stop with extracting a single dimension and simply concluding that the women either liked or disliked the perfumes they tested and rated everything either uniformly higher or lower. They would speak of halo effects and question whether any more than an overall score could be extracted from the data. Nevertheless, as we see from the above partial correlation network, there is an interpretable local structure even when all the variables are highly interrelated.

I have discussed this issue before in a post about separating global from specific factors. The bifactor model outlined in that post provides another view into the structure of the perfume rating data. What if there were a global factor explaining what we might call the "halo effect" (i.e., uniformly high correlations among all the variables) and then additional specific factors accounting for the extra correlation among different subsets of variables (e.g., the regions in the above partial correlation network map)?

The bifactor diagram shown below may not be pretty with so many variables to be arrayed. However, you can see the high factor loadings radiating out from the global factor g and how the specific factors F1* through F6* provide a secondary level of local structure corresponding to the regions identified in the above network.

I will end with a technical note. The 1321 observations were nested within the 11 perfumes with each respondent seeing only one perfume. Although we would not expect the specific perfume rated to alter the correlations (factorial invariance), mean-level differences between the perfumes could inflate the correlations calculated over the entire sample. In order to test this, I reran the analysis with deviation scores by subtracting the corresponding mean perfume score from each respondent's original ratings. The results were essentially the same.

R Code Needed to Import CSV File and Produce Plots

# Set working directory and import data file
setwd("C:/directory where file located")
perfume<-read.csv("Perfume.csv", sep=";")
apply(perfume, 2, function(x) table(x,useNA="always"))
# Calculates Sparse Partial Correlation Matrix
sparse_matrix<-EBICglasso(cor(perfume[,2:48]), n=1321)
qgraph(sparse_matrix, layout="spring", 
       label.scale=FALSE, labels=names(perfume)[2:48],
       label.cex=1, node.width=.5)
# Purchase Intent Not Included
omega(perfume[,3:48], nfactors=6)
Created by Pretty R at

Thursday, December 10, 2015

Attitudes Modeled as Networks

In case you missed it, Jonas Dalege and his colleagues at the PsychoSystems research group have recently published an article in Psychological Review detailing how attitudes can be represented as network graphs. It is all done using R and a dataset that can be downloaded by registering at the ANES data center. You will find the R code under Scripts and Code in a file called ANES 1984 Analyses. With very minor changes to the size of some labeling, I was able to reproduce the above undirected graph with two R packages: IsingFit and qgraph. As usual when downloading others' files, most of the R code is data munging and deals with assigning labels and transforming ratings into dichotomies.

The above graph represents the conditional independence relationships among node pairings. Specifically, edges are drawn between pairs of nodes only if they are still related after controlling for all the other nodes not in that pair. The center nodes in red are assessments of Ronald Reagan's ability, decency and caring. The groupings of the red nodes seem reasonable, for example, the thicker green edges connected knowledgeable, hard-working, decent and moral. Similarly, in touch, understands and cares are also drawn together by stronger relationships. These evaluative judgments are joined by positive green edges to the respondents' feelings of pride and hope (blue nodes). Moreover, they are pushed away by negative red pathways from darker emotional reactions such as fear, anger and disgust (green nodes).

One should not be surprised to learn that it makes a difference whether the attitudes are scored dichotomously (e.g., yes/no, agree/disagree or present/absent) or using some ordinal rating scale. If it helps, you can think of this as you might the distinction between regression (continuous) and classification (discrete) in statistical learning theory. Thus, when I analyzed a set of mobile phone ratings gathered with 10-point scales, I borrowed a graphical lasso model called EBICglasso from the qgraph R package (see Undirected Graphs When the Causality is Mutual). On the other hand, the Ising model from the IsingFit R package was needed when the data came from yes/no checklists (see The Network Underlying Consumer Perceptions of the European Car Market).

Friday, December 4, 2015

The Topology Underlying the Brand Logo Naming Game: Unidimensional or Local Neighborhoods?

You can find the app on iTunes and Google Play. It's a game of trivial pursuits - here's the logo, now tell me the brand. Each item is scored as right or wrong, and the players must take it all very seriously for there is a Facebook page with cheat sheets for improving one's total score.

Psychometrics Sees Everything as a Test

What would a psychometrician make of such a game based on brand logo knowledge? Are we measuring one's level of consumerism ("a preoccupation with and an inclination toward buying consumer goods")? Everyone knows the most popular brands, but only the most involved are familiar with logos of the less publicized products. The question for psychometrics is whether they are able to explain the logos that you can identify correctly by knowing only your level of consumption.

For example, if you were a car enthusiast, then you would be able to name all the car logos in the above table. However, if you did not drive a car or watch commercial television or read car ads in print media, you might be familiar with only the most "popular" logos (i.e., the ones that cannot be avoided because their signage is everywhere you look). We make the assumption that everyone falls somewhere between these two extremes along a consumption continuum and assess whether we can reproduce every individual pattern of answers based solely on their location on this single dimension. Shopping intensity or consumerism is the path, and logo identifications are the sensors along that path.

Specifically, if some number of N respondents played this game, it would not be difficult to rank order the 36 logos in the above table along a line stretching from 0% to 100% correct identification. Next, we examine each respondent, starting by sorting the players from those with the fewest correct identifications to those getting the most right. As shown in an earlier post, a heatmap will reveal the relationship between the ease of identifying each logo and the overall logo knowledge for each individual, as measured by their total score over all the brand logos. [The R code required to simulate the data and produce the heatmap can be found at the end of this post.]

You can begin by noting that blue is correct and red is not. Thus, the least knowledgeable players are in the top rows filled with the most red and the least blue. The logos along the x-axis are sorted by difficulty with the hardest to name on the left and the easiest on the right. In general, better players tend to know the harder logos. This is shown by the formation of a blue triangle as one scans towards the lower, right-hand corner. We call this a Guttman scale, and it suggests that both variation among the logos and the players can be described by a single dimension, which we might call logo familiarity or brand presence. However, one must be wary of suggestive names like "brand presence" for over time we forget that we are only measuring logo familiarity and not something more impactful.

Our psychometrician might have analyzed this same data using the R package ltm for latent trait modeling. A hopefully intuitive introduction to item response modeling was posted earlier on this blog. Those results could be summarized with a series of item characteristic curves displaying the relationship between the probability of answering correctly and the underlying trait, labeled ability by default.

As you see in the above plot, the items are arranged from the easiest (V1) to the hardest (V36) with the likelihood of naming the logo increasing as a logistic function of the unobserved consumerism measured as z-scores and called ability because item response theory (IRT) originated in achievement testing. These curves are simple to read and understand. A player with low consumption (e.g., a z-score near -2) has a better than even chance of identifying the most popular logos, but almost zero probability of naming any of the least familiar logos. All those probabilities move up their respective S-curves together as consumers become more involved.

In this example the function has been specified for I have plotted the item characteristics curves from the one-parameter Rasch model. However, a specific functional form is not required, and we could have used the R package KernSmoothIRT to fit a nonparametric model. The topology remains a unidimensional manifold, something similar to Hastie's principal curve in the R package princurve. Because the term has multiple meanings, I should note that I am using "topology" in a limited sense in order to refer to the shape of the data and not as in topological data analysis.

To be clear, there must be powerful forces at work to constrain logo naming to a one-dimensional continuum. Sequential skills that build on earlier achievements can often be described by a low-dimensional manifold (e.g., learning descriptive statistics before attempting inference since the latter assumes knowledge of the former). We would have needed a different model had our brands been local so that higher shopping intensity would have produced greater familiarity only for those logos available in a given locality (e.g., country-specific brands without an international presence).

The Meaning of Brand Familiarity Depends on Brand Presence in Local Markets

Now, it gets interesting. We started with players differentiated by a single parameter indicating how far they had traveled along a common consumption path. The path markers or sensors are the logos arrayed in decreasing popularity. Everyone shares a common environment with similar exposures to the same brand logos. Most have seen the McDonald's double-arcing M or the Nike swoosh because both brands have spent a considerable amount of money to buy market presence. On the other hand, Hilton's "blue H in the swirl" with less market presence would be recognized less often (fourth row and first column in the above brand logo table).

But what if market presence and thus logo popularity depended on your local neighborhood? Even international companies have differential presence in different countries, as well as varying concentration within the same country. Spending and distribution patterns by national, regional and local brands create clusters of differential market presence. Everyone does not share a common logo exposure so that each cluster requires its own brand list. That is, consumers reside in localities with varying degrees of brand presence so that two individuals with identical levels of consumption intensity or consumerism would not be familiar with the same brand logos. Consequently, we need to add a second parameter to each individual's position along a path specific to their neighborhood. The psychometrician calls this differential item functioning (DIF), and R provides a number of ways of handling the additional mixture parameter.

Overlapping Audiences in the Marketplace of Attention

You may have anticipated the next step as the topology becomes more complex. We began with one pathway marked with brand logos as our sensors. Then, we argued for a mixture model with groups of individuals living in different neighborhoods with different ordering of the brand logos. Finally, we will end by allowing consumers to belong to more than one neighborhood with whatever degree of belonging they desire. We are describing the kind of fragmentation that occurs when consumers seize control and there is more available to them than they can attend to or consider. James Webster outlines this process of audience formation in his book The Marketplace of Attention.

The topology has changed again. There are just too many brand logos, and unless it becomes a competitive game, consumers will derive diminishing returns from continuing search and they typically will stop sooner rather than later. It helps that the market comes preorganized by providers trying to make the sale. Expert reviews and word of mouth guide the search. Yet, it is the consumer who decides what to select from the seemingly endless buffet. In the process, an individual will see and remember only a subset of all possible brand logos. We need a new model - one that simultaneously sorts both rows and columns by grouping together consumers and the brand logos that they are likely to recognize.

A heatmap may help to explain what can be accomplished when we search for joint clusterings of the rows and columns (also known as biclustering). Using an R package for nonnegative matrix factorization (NMF), I will simulate a data set with such a structure and show you the heatmap. Actually, I will display two heatmaps, one without noise so that you can see the pattern and a second with the same pattern but with added noise. Hopefully, the heatmap without noise will enable you to see the same pattern in the second heatmap with additional distortions.

I kept the number of columns at 36 for comparison with the first one-dimensional heatmap that you saw toward the beginning of this post. As before, blue is one, and red is zero. We discover enclaves or silos in the first heatmap without noise (polarization). The boundaries become fuzzier with random variation (fragmentation). I should note that you can see the biclusters in both heatmaps without reordering the rows and columns only because this is how the simulator generates the data. If you wish to see how this can be done with actual data, I have provided a set of links with the code needed to run a NMF in R at the end of my post on Brand and Product Category Representation.

Finally, although we speak of NMF as a form of simultaneous clustering, the cluster membership are graded rather than all-or-none (soft vs. hard clustering). This yields a very flexible and expressive topology, which becomes clear when we review the three alternative representations presented in this post. First, we saw how some highly structured data matrices can be reproduced using a single dimension with rows and columns both located on the same continuum (IRT). Next, we asked if there might be discrete groups of rows with each row cluster having its own unique ordering of the columns (mixed IRT). Lastly, we sought a model of audience formation with rows and columns jointly collected together into blocks with graded membership for both the rows and the columns (NMF).

Knowledge is organized as a single dimension when learning is formalized within a curriculum (e.g., a course at an educational institution) or accumulative (e.g., need to know addition before one can learn multiplication). However, coevolving networks of customers and products cannot be described by any one dimension or even a finite mixture of different dimensions. The Internet creates both microgenres and fragmented audiences that require their own topology.

R Code to Produce Figures in this Post

# use psych package to simulate latent trait data
logos<-sim.irt(nvar=36, n=500, mod="logistic")
# Sort data by both item mean
# and person total score
# create heatmap
# may need to increase size of plots window in R studio
heatmap.2(logos$itemsOrd, Rowv=FALSE, Colv=FALSE, 
          dendrogram="none", col=redblue(16), 
          key=T, keysize=1.5,"none", 
          trace="none", labRow=NA)
# two-parameter logistic model
fit<-ltm(logos$items ~ z1)
# item characteristic curves
# constrains slopes to be equal
# generate a synthetic dataset with 
# 500 rows and three groupings of
# columns (1-10, 11-20, and 21-36)
n <- 500
counts <- c(10, 10, 16)
# no noise
V1 <- syntheticNMF(n, counts, noise=FALSE)
# with noise
V2 <- syntheticNMF(n, counts)
# produce heatmap with and without noise
heatmap.2(V1, Rowv=FALSE, Colv=FALSE, 
          dendrogram="none", col=redblue(16), 
          key=T, keysize=1.5,"none", 
          trace="none", labRow=NA)
heatmap.2(V2, Rowv=FALSE, Colv=FALSE, 
          dendrogram="none", col=redblue(16), 
          key=T, keysize=1.5,"none", 
          trace="none", labRow=NA)
Created by Pretty R at

Wednesday, December 2, 2015

The Statistician as Data Action Hero

"If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools."
Leo Breiman

Breiman ends the opening abstract of his "manifesto" to the statistical community with the above call to action. As a statistics professor at Berkeley, he was not an outsider.

Unfortunately, the warning was not heeded and now statistics laments with the President of the American Statistical Association asking "Aren't We Data Science?". Interestingly, one of the comments objects to a suggestion in this editorial that statisticians ought to learn R.

Clearly, someone has missed the point. It is not what you call yourself, who signs your paycheck, or the size of your data sets. It is your openness to disruptive innovation and your desire to make a difference. R provides the interface that makes this possible.

Breiman states it well at the end of an interview from 2001,
So I think if I were advising a young person today, I would have some reservations about advising him or her to go into statistics, but probably, in the end, I would say, “Take statistics, but remember that the great adventure of statistics is in gathering and using data to solve interesting and important real world problems.”

Tuesday, November 24, 2015

Statistical Models That Support Design Thinking: Driver Analysis vs. Partial Correlation Networks

We have been talking about design thinking in marketing since Tim Brown's Harvard Business Review article in 2008. It might be easy for the data scientist to dismiss the approach as merely a type of brainstorming for new products or services. Yet, design issues do arise in data visualization where we are concerned with communicating our findings. However, my interest is model selection: Should the analyst select one statistical model over another because the user might find it more helpful in planning interventions or designing new products and services?

For example, the marketing manager who wants to retain current customers seeks guidance from customer satisfaction questionnaires filled with performance ratings and intentions to recommend or purchase again. Motivated by the desire to keep it simple, common practice tends to focus attention on only the most important "causes" of customer retention. As I noted in my first post, Network Visualization of Key Driver Analysis, a more complete picture can be revealed by a correlation graph displaying all the interconnections among all the ratings. The edges or links are colored green or red so that we know if the relationship is positive or negative. The thickest of the path indicates the strength of the correlation. But correlations measure total effects, both those that are direct and those obtained through associations with other ratings.

The designer of intervention strategies aimed at preventing churn could acquire additional insights from the partial correlation graph depicting the effects between all pairs of ratings controlling for all the other ratings in the model. While the correlation map reveals total effects, the partial correlation map removes all but the direct effects. The graph below was created using the R code from my first post to simulate a data set that mimics what is often found when airline passengers complete satisfaction surveys. Once the data were generated, the procedures outlined in my post Undirected Graphs When the Causality is Mutual were followed. The R code is listed at the end of this discussion.

We can pick any node, such as the one labeled "Satisfaction" in the middle of the right-hand side of the figure. A simple way of interpreting this graph is to think of Satisfaction as the dependent variable and the lines radiating from Satisfaction as the weights obtained from the regression of this node on the other 14 ratings. Clearly, overall satisfaction serves as an inclusive summary measure with so many pathways from so many other nodes. Each of the four customer service ratings (below Satisfaction and in light pink) adds its own unique contribution with the greatest impact indicated by the thickest green edge from the Service node. Moreover, Easy Reservation and Ticket Price plus Clean Aircraft with room for people and baggage make incremental improvements in Satisfaction.

The same process can be repeated for any node. Instead of a driver analysis that narrows our thinking to a single dependent variable and its highest regression weights, the partial correlation map opens us to the possibilities. If the goal was customer retention, then the focus would be on the Fly Again node. Recommend seems to have the strongest link to Fly Again. Can the airline induce repeat purchase by encouraging recommendation? What if frequent flyer miles were offered when others entered your name as a recommender? Such a proposal may not be practical in its current form, but the graph supports this type of design thinking.

Because there are no direct paths from the four service nodes to Fly Again, a driver analysis would miss the indirect connection through Satisfaction. And what of this link between Courtesy and Easy Reservation? Do customers infer a "friendly" personality trait that links their perceptions of the way they are treated when they buy a ticket and when they board the plane? Design thinkers would entertain such a possibility and test the hypothesis. Such "cascaded inferences" fill the graph for those willing to look. Perhaps many small and less costly improvements might combine to have a greater impact than concentrating on a single aspect? Encouraging passengers to check their bags would create more overhead storage without reconfiguring the airplane. Let the design thinking begin!

The discussion ends with the identification of "the most important" in a driver analysis. The network, on the other hand, invites creative thought. Isn't this the point of data science? What can we learn from the data? The answer is a good deal more than can be revealed by the largest coefficient in a single regression equation.

# Calculates Sparse Partial Correlation Matrix
sparse_matrix<-EBICglasso(cor(ratings), n=1000)
# Plots results
qgraph(sparse_matrix, fade = FALSE, layout="spring", groups=gr, 
       color=node_color,labels=names(ratings), label.scale=FALSE, 
       label.cex=1, node.width=.5, edge.width=.25, minimum=.05)
Created by Pretty R at

Sunday, November 8, 2015

Mutually Exclusive Clusters Are Boxes within Which Consumers No Longer Fit

Sometimes we force our categories to be mutually exclusive and exhaustive even as the boundaries are blurring rapidly.

Of course, I am speaking of cluster analysis and whether it makes sense to force everyone into one and only one of a set of discrete boxes. Diversity is diverse and requires a more expressive representation than possible in a game of twenty questions. "Is it this or that?" is inadequate when it is a little of this and a lot of that.

How do you classify a love seat? Is it a small sofa or a large chair for two people? Natural categories are not defined by all-or-none criteria. All birds do not possess the same degree of "birdness" (as shown below for the graded structure underlying the classification of birds). Some birds are more "bird" than other birds, and some mammals might be thought of as birds (bats) because they look and behave more like birds than typical mammals.

In an earlier post, I issued a warning that clusters may appear more separated in textbooks than in practice. I urged that we consider other representations for individual variation. Archetypes work because they reside at the periphery of the objects to be described with all the other species in-between (e.g., birds of prey, household pets in cages, winter migrants, evil dark birds, and white birds of peace). In marketing this is the realm of fans, fanatics and -philes in which it is so easy to visualize the extreme users and name everyone else as hybrid combinations of pure types. R makes the analysis doable. Moreover, with dimensions defined as contrasting ideals (liberal vs. conservative), archetypal analysis mimics hidden dimensions such as those from factor analysis and item response theory.

The Forces behind Diversity Do Not Yield Disjoint Clusters

I have found it best to begin with the forces generating diversity. Why doesn't one size fit all? As a marketer, I look toward consumer demand - whether it originates out of individual usage, preference or need or whether it is manufactured by providers introducing new products and services. I discover that demand is seldom contained within a single box. Some cable television viewers want to watch lots of sports, so let us place them in the sport segment. But they also want movies-on-demand, so we need four segments filling in the 2x2 for sports and movies-on-demand. Hopefully, they do not want to hear the business news because now we have 8 segments in our 2x2x2. As the consumer acquires greater control, the old segmentation scheme seems more and more forced as if we are holding onto a simpler world with everyone crammed into one of only a few silos.

Recommendation systems adopt a different metaphor with the marketplace partitioned by user collaboration and the chunking of offerings into micro-genres. Heterogeneity is seen as coevolving networks of consumers and what they buy. A handful of mutually exclusive boxes will not work in a market that is increasingly fragmenting.

When there are so many alternatives within easy reach of the internet, no consumer can attend to it all. The rows in our data matrix, each containing information from a single individual, become both longer with more options and sparser with limited attention. If we want to play in the marketplace of attention, we will need more than K-means or finite mixture models. Researchers will require even more - the type of easy access provided by R packages such as archetype and NMF.

With the appropriate statistical model, one can uncover such generating processes from the data matrix with consumers as rows and what they want or like in the columns. Using figurative language I have called this approach the "ecology of data matrices" and have suggested the need for biclustering. Yet, there is opposition since we are so accustomed to dividing the analysis of rows and columns into two separate procedures. Most cluster analyses input all the columns to calculate distances among the rows. Factor analysis starts with column correlations computed from data including every row. Biclustering, on the other hand, cares about sorting the cells into simultaneous groupings of row-column combinations. The data matrix gets divided into subspaces, possibly overlapping, with this community of consumers similar on only those groupings of variables.

The simplicity of boxes will not work with consumers in control and more available than any one buyer can attend to or know about. An underlying structure remains, but one defined by the joint interaction of rows and columns. Consumers with common needs and experiences are attract to the same purchase channels and learn about offerings from the same sources. This simultaneous clustering of rows and columns are the blocks from which consumers customized their own personal consumption patterns. Nothing forces the consumer to select only one building block. In fact, the opposite is more generally true for most of us play multiple roles (e.g., items purchased for work and play, self and others, necessities and gifts, and the list goes on). To capture such common practices, we need a clustering technique that does not impose a simplistic representation forcing consumers into boxes within which they no longer fit.

Sunday, November 1, 2015

Clustering Customer Satisfaction Ratings

We run our cluster analysis with great expectations, hoping to uncover diverse segments with contrasting likes and dislikes of the brands they use. Instead, too often, our K-means analysis returns the above graph of parallel lines indicating that the pattern of high and low ratings are the same for everyone but at different overall levels. The data come from the R package semPLS and look very much like what one sees with many customer satisfaction surveys.

I will not cover any specifics about the data, but instead refer you to earlier discussions of this dataset, first in a post showing its strong one-dimensional structure using biplots and later in an example of an undirected graph or Markov network displaying brand associations.

We will begin with the mean ratings for the four lines in the above graph and include a relatively small fifth segment in the last column with a different narrative. Ordering the 23 items from lowest to highest mean scores over the entire sample makes both the table below and the graph above easier to read.

not at all
a little
a lot

You can pick any row in this table and see that the first four segments with 90% of the customers are ordered the same. The first cluster is simply not at all happy with their mobile phone provider. They give the lowest Buy Again and Recommend ratings. In fact, with only two small exceptions, they uniformly give the lowest scores. For every row the second column is larger (note the two discrepancies already mentioned), followed by an even bigger third column, and then the most favorable fourth column. Successful brands have loyal customers, and at least one out of five customers in this data have "a lot" of love with a mean ratings of 9.7 on a 10-point scale.

You can see why I labeled these four segments with names suggesting differing levels of attraction. Each group has the same profile, as can be seen in the largely parallel lines on our graph. The good news for our providers is that only 9% are definitely at risk. The bad news is that another 10% like the product and the service but will not buy again, perhaps because the price is not perceived as fair (see their graph below with a dip for the second variable, Buy Again, and a much lower score than expected for the first variable, Fair Price, given the elevation of the rest of the curve).

Some might argue that what we are seeing is merely a measurement bias reflecting a propensity among raters to use different portions of the scale. Does this mean that 90% of the customers have identical experiences but give different ratings due to some scale-usage predisposition? If it is a personality trait, does this mean that they use the same range of scale values to rate every brand and every product? Would we have seen individuals using the same narrow range of scores had the items been more specific and more likely to show variation, for example, if they had asked about dropped calls and dead zones rather than network quality?

Given questions without any concrete referent, the uniform patterns of high and low ratings across the items are shaped by a network of interconnected perceptions resulting from a common technology and a shared usage of that technology. In addition, one overhears a good deal of discussion about the product category in the media and from word-of-mouth so that even a nonuser might be aware of the pros and cons. As a result, we tend to find a common ordering of ratings with some customers loving it all "a lot" and others "not at all." Unless customers can provide a narrative (e.g., "I like the product and service, but it costs too much"), they will all reproduce the same profile of strengths and weaknesses at varying levels of overall happiness. That is, satisfied or not, almost everyone seems to rate value and price fairness lower than they score overall quality and satisfaction.

Finally, my two prior posts cited earlier may seem to paint a somewhat contradictory picture of customer satisfaction ratings. On the one hand, we are likely to find a strong first principal component indicating the presence of a single dimension underlying all the ratings. Customer satisfaction tends to be one-dimensional so that we might expect to observe the four clusters with parallel lines of ratings. Satisfaction falls for everyone as features and services become more difficult for any brand to deliver. On the other hand, the graph of the partial correlations suggests a network of interconnected pairs of ratings after controlling for the all the remaining items. One can identify regions with stronger relationships among items measuring quality, product offering, corporate citizenship, and loyalty.

Both appear to be true. Rating with the highest partial intercorrelations form local neighborhoods with thicker edges in our undirected graph. Although some nodes are more closely related, all the variables are still connected either directly with a pairwise edge or indirectly through a separating node. Everything is correlated, but some are more correlated than others.

R code needed to reproduce these tables and plots.

# descriptive names for graph nodes
# kmeans with 5 cluster and 25 random starts
kcl5<-kmeans(mobi[,-9], 5, nstart=25)
# cluster profiles and sizes
# row and column means
row_mean<-apply(cluster_profile, 1, mean)
col_mean<-apply(cluster_profile, 2, mean)
# Cluster profiles ordered by row means
# columns sorted so that 1-4 are increasing means
# and the last column has low only for buyagain & fairprice
# Warning: random start values likely to yield different order
# reordered cluster sizes and profiles
# plots for first 4 clusters
matplot(sorted_profile[,-5], type = c("b"), pch="*", lwd=3,
        xlab="23 Brand Ratings Ordered by Average for Total Sample",
        ylab="Average Ratings for Each Cluster")
title("Loves Me Little, Some, A Lot, Not At All")
# plot of last cluster
matplot(sorted_profile[,5], type = c("b"), pch="*", lwd=3,
        xlab="23 Brand Ratings Ordered by Average for Total Sample",
        ylab="Average Ratings for Last Cluster")
title("Got to Switch, Costs Too Much")
Created by Pretty R at