Engaging Market Research

Building the Data Matrix for the Task at Hand and Analyzing Jointly the Resulting Rows and Columns

2016-06-05T17:07:00.001-07:00

Someone decided what data ought to go into the matrix. They placed the objects of interest in the rows and the features that differentiate among those objects into the columns. Decisions were made either to collect information or to store what was gathered for other purposes (e.g., data mining).

A set of mutually constraining choices determines what counts as an object and a feature. For example, in the above figure, the beer display case in the convenience store contains a broad assortment of brands in both bottles and cans grouped together by volume and case sizes. The features varying among these beers are the factors that consumers consider when making purchases for specific consumption situations: beers to bring on an outing, beers to have available at home for self and guests, and beers to serve at a casual or formal party.

These beers and their associated attributes would probably not have made it into a taste test at a local brewery. Craft beers come with their own set of distinguishing features: spicy, herbal, sweet and burnt (see the data set called beer.tasting.notes in the ExPosition R package). Yet, preference still depends on consumption occasion for a craft beer needs to be paired with food and desires vary by time of day and ongoing activities and who is drinking with you. What gets measured and how it is measured depends on the situation and the accompanying reasons for data collection (the task at hand).

It should be noted that data matrix construction may evolve over time and may change with context. Rows and columns are added and deleted in order to maintain a type of synchronization with one beer suggesting the inclusion of other beers in the consideration set and at the same time inserting new columns into the data matrix in order to differentiate among the additional beers.

Given such mutual dependencies, why would we want to perform separate analyses of the rows (e.g., cluster analysis) and the columns (e.g., factor analysis), as if row (column) relationships were invariant regardless of the columns (rows)? That is, the similarity of the rows in a cluster analysis depends on the variables populating the columns, and the correlations among the variables are calculated as shared covariation over the rows. The correlation between two variables changes with the individuals sampled (e.g., restriction of range), and the similarity between two observations fluctuates with the basis of comparison (e.g., features included in the object representation). Given the shared determination of what gets included as the rows and the columns of our data matrix, why wouldn't we turn to a joint scaling procedure such as correspondence analysis (when the cells are counts) or biplots (when the cells are ratings or other continuous measures)?

We can look at such a joint scaling using that beer.tasting.notes data set mentioned earlier in this post. Some 38 craft beers were rated on 16 different flavor profiles. Although the ExPosition R package contains functions that will produce a biplot, I will run the analysis with FactoMineR because it may be more familiar and I have written about this R package previously. The biplot below could have been shown in a single graph, but it might have been somewhat crowded with 38 beers and 16 ratings on the same page. Instead, by default, FactoMineR plots the rows on the Individuals factor map and the columns on the Variables factor map. Both plots come from a single principal component analysis with rows as points and columns as arrows. The result is a vector map with directional interpretation, so that points (beers) have perpendicular projections onto arrows (flavors) that reproduce as closely as possible the beer's rating on the flavor variable.

Beer tasting provides a valuable illustration of the process by which we build the data matrix for the task at hand. Beer drinkers are not as likely as those drinking wine to discuss the nuances of taste so that you might not know the referents for some of these flavor attributes. As a child, you learned which citrus fruits are bitter and sweet: lemons and oranges. How would you become a beer expert? You would need to participate in a supervised learning experience with beer tastings and flavor labels. For instance, you must taste a sweet and a bitter beer and be told which is which in order to learn the first dimension of sweet vs. bitter as shown in the above biplot.

Obviously, I cannot serve you beer over the Internet, but here is a description of Wild Devil, a beer positioned on the Individuals Factor Map in the direction pointed to by the labels Bitter, Spicy and Astringent on the Variables Factor Map.

It’s arguable that our menacingly delicious HopDevil has always been wild. With bold German malts and whole flower American hops, this India Pale Ale is anything but prim. But add a touch of brettanomyces, the unruly beast responsible for the sharp tang and deep funk found in many Belgian ales, and our WildDevil emerges completely untamed. Floral, aromatic hops still leap from this amber ale, as a host of new fermentation flavor kicks up notes of citrus and pine.

Hopefully, you can see how the descriptors in our columns acquire their meaning through an understanding of how beers are brewed. It is an acquired taste with the acquisition made easier by knowledge of the manufacturing process. What beers should be included as rows? We will need to sample all those variations in the production process that create flavor differences. The result is a relatively long feature list. Constraints are needed and supplied by the task at hand. Thankfully, the local brewery has a limited selection, as shown in the very first figure, so that all we need are ratings that will separate the five beers in tasting tray.

In this case the data matrix contains ratings, which seem to be the averages of two researchers. Had there been many respondents using a checklist, we could have kept counts and created a similar joint map with correspondence analysis. I showed how this might be done in an earlier post mapping the European car market. That analysis demonstrated how the Prius has altered market perceptions. Specifically, Prius has come to be so closely associated with the attributes Green and Environmentally Friendly that our perceptual map required a third dimension anchored by Prius pulling together Economical with Green and Environmental. In retrospect, had the Prius not been included in the set of objects being rated, would we have included Green and Environmentally Friendly in the attribute list?

Now, we can return to our convenience store with a new appreciation of the need to analyze jointly the rows and the columns of our data matrix. After all, this is what consumers do when they form consideration sets and attend only to the differentiating features for this smaller number of purchase alternatives. Human attention enables us to find the cooler in the convenience store and then ignore most of the stuff in that cooler. Preference learning involves knowing how to build the data matrix for the task at hand and thus simplifying the process by focusing only on the most relevant combination of objects and features.

Finally, I have one last caution. There is no reason to insist on homogeneity over all the cells of our data matrix. In the post referenced in the last paragraph, Attention is Preference, I used nonnegative matrix factorization (NMF) to jointly partition the rows and columns of the cosmetics market into subspaces of differing familiarity with brands (e.g., brands that might be sold in upscale boutiques or department stores versus more downscale drug stores). You should not be confused because the rows are consumers and not objects such as car models and beers. The same principles apply whenever goes in the rows. The question is whether all the rows and columns can be represented in a single common space or whether the data matrix is better described as a number of subspaces where blocks of rows and columns have higher density with sparsity across the rest of the cells. Here are these attributes that separate these beers, and there are those attributes that separate those beers (e.g., the twist-off versus pry-off cap tradeoff applies only to beer in bottles).

R Code to Run the Analysis:

# ExPosition must be installed
library(ExPosition)
 
# load data set and examine its structure
data("beer.tasting.notes")
str(beer.tasting.notes)
head(beer.tasting.notes$data)
 
# how ExPosition runs the biplot
pca.beer <- epPCA(beer.tasting.notes$data)
 
# FactoMiner must be installed
library(FactoMineR)
 
# how FactoMineR runs the biplot
fm.beer<-PCA(beer.tasting.notes$data)

Created by Pretty R at inside-R.org

Using Support Vector Machines as Flower Finders: Name that Iris!

2016-05-25T12:48:00.000-07:00

Nature field guides are filled with pictures of plants and animals that teach us what to look for and how to name what we see. For example, a flower finder might display pictures of different iris species, such as the illustrations in the plot below. You have in hand your own specimen from your garden, and you carefully compare it to each of the pictures until you find a good-enough match. The pictures come from Wikipedia, but the data used to create the plot are from the R dataset iris: sepal and petal length and width measured on 150 flowers equally divided across three species.

I have lifted the code directly from the svm function in the R package e1071.

library(e1071)
data(iris)
attach(iris)
 
## classification mode
# default with factor response:
model <- svm(Species ~ ., data = iris)
 
# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y) 
 
print(model)
summary(model)
 
# test with train data
pred <- predict(model, x)
# (same as:)
pred <- fitted(model)
 
# Check accuracy:
table(pred, y)
 
# compute decision values and probabilities:
pred <- predict(model, x, decision.values = TRUE)
attr(pred, "decision.values")[1:4,]
 
# visualize (classes by color, SV by crosses):
plot(cmdscale(dist(iris[,-5])),
     col = as.integer(iris[,5]),
     pch = c("o","+")[1:150 %in% model$index + 1])

Created by Pretty R at inside-R.org

We will focus on the last block of R code that generates the metric multidimensional scaling (MDS) of the distances separating the 150 flowers calculated from sepal and petal length and width (i.e., the dist function applied to the first four columns of the iris data). Species plays no role in the MDS with the flowers positioned in a two-dimensional space in order to reproduce the pairwise Euclidean distances. However, species is projected onto the plot using color, and the observations acting as support vectors are indicated with plus signs (+).

The setosa flowers are represented by black circles and black plus signs. These points are separated along the first dimension from the versicolor species in red and virginica in green. The second dimension, on the other hand, seems to reflect some within-species sources of differences that do not differentiate among the three iris types.

We should recall that the dist function calculates pairwise distances in the original space without any kernel transformation. The support vectors, on the other hand, were identified from the svm function using a radial kernel and then projected back onto the original observation space. Of course, we can change the kernel, which defaults to "radial" as in this example from the R package. A linear kernel may do just as well with the iris data, as you can see by adding kernel="linear" to the svm function in the above code.

It appears that we do not need all 150 flowers in order to identify the iris species. We know this because the svm function correctly classifies over 97% of the flowers with 51 support vectors (also called "landmarks" as noted in my last post Seeing Similarity in More Intricate Dimensions). The majority of the +'s are located between the two species with the greatest overlap. I have included the pictures so that the similarity of the red and green categories is obvious. This is where there will be confusion, and this is where the only misclassifications occur. If your iris is a setosa, your identification task is relatively easy and over quickly. But suppose that your iris resembles those in the cluster of red and green pluses between versicolor and virginica. This is where the finer distinctions are being made.

By design, this analysis was kept brief to draw an analogy between support vector machines and finder guides that we have all used to identify unknown plants and animals in the wild. Hopefully, it was a useful comparison that will help you understand how we classify new observations by measuring their distances in a kernel metric from a more limited set of support vectors (a type of global positioning with a minimal number of landmarks or exemplars as satellites).

When you are ready with your own data, you can view the videos from Chapter 9 of An Introduction to Statistical Learning with Applications in R to get a more complete outline of all the steps involved. My intent was simply to disrupt the feature mindset that relies on the cumulative contributions of separate attributes (e.g., the relative impact of each independent variable in a prediction equation). As objects become more complex, we stop seeing individual aspects and begin to bundle features into types or categories. We immediately recognize the object by its feature configuration, and these exemplars or landmarks become the new basis for our support vector representation.

The Kernel Trick in Support Vector Machines: Seeing Similarity in More Intricate Dimensions

2016-05-23T18:42:00.000-07:00

The "kernel" is the seed or the essence at the heart or the core, and the kernel function measures distance from that center. In the following example from Wikipedia, the kernel is at the origin and the different curves illustrate alternative depictions of what happens as we move away from zero.

At what temperature do you prefer your first cup of coffee? If we center the scale at that temperature, how do we measure the effects of deviations from the ideal level. The uniform kernel function tells us that closest to the optimal makes little difference as long as it is within a certain range. You might feel differently, perhaps it is a constant rate of disappointment as you move away from the best temperature in either direction (a triangular kernel function). However, for most us, satisfaction takes the form of exponential decay with a Gaussian kernel describing our preferences as we deviate from the very best.

Everitt and Hothorn show how it is done in R for density estimation. Of course, the technique works with any variable, not just preference or distance from the ideal. Moreover, the logic is the same: give greater weight to closer data. And how does one measure closeness? You have many alternatives, as shown above, varying from tolerant to strict. What counts as the same depends on your definition of sameness. With human vision the person retains their identity and our attention as they walk from the shade into the sunlight; my old camera has a different kernel function and fails to keep track or focus correctly. In addition, when the density being estimated is multivariate, you have the option of differential weighting of each variable so that some aspects will count a great deal and others can be ignored.

Now, with the preliminaries over, we can generalize the kernel concept to support vector machines (SVMs). First, we will expand our feature space because the optimal cup of coffee depends on more than its temperature (e.g., preparation method, coffee bean storage, variation on finest and method of grind, ratio of coffee to water, and don't forget the the type of bean and its processing). You tell me the profile of two coffees using all those features that we just enumerated, and I will calculate their pairwise similarity. If their profiles are identical, the two coffees are the same and centered at zero. But if they are not identical, how important are the differences? Finally, we ought to remember that differences are measured with respect to satisfaction, that is, two equally pleasing coffees may have different profiles but the differences are not relevant.

As the Mad Hatter explained in the last post, SVMs live in the observation space, in this case, among all the different cups of coffees. We will need a data matrix with a bunch of coffees for taste testing in the rows and all those features as columns, plus an additional column with a satisfaction rating or at least a thumbs-up or thumbs-down. Keeping it simple, we will stay with a classification problem distinguishing good from bad coffees. Can I predict your coffee preference from those features? Unfortunately, individual tastes are complex and that strong coffee may be great for some but only when hot. What of those who don't like strong coffee? It is, as if, we had multiple configurations of interacting nonlinear features with many more dimensions than can be represented in the original feature space.

Our training data from the taste tests might contain actual coffees near each of these configurations differentiating the good and the bad. These are the support vectors of SVMs, what Andrew Ng calls "landmarks" in his Coursera course and his more advanced class at Stanford. In this case, the support vectors are actual cups of coffee that you can taste and judge as good or bad. Chapter 9 of An Introduction to Statistical Learning will walk you through the steps, including how to run the R code, but you might leave without a good intuitive grasp of the process.

It would help to remember that a logistic regression equation and the coefficients from a discriminant analysis yield a single classification dimension when you have two groupings. What happens when there are multiple ways to succeed or fail? I can name several ways to prepare different types of coffee, and I am fond of them all. Similarly, I can recall many ways to ruin a cup of coffee. Think of each as a support vector from the training set and the classification function as a weighted similarity to instances from this set. If a new test coffee is similar to one called "good" from the training data, we might want to predict "good" for this one too. The same applies to coffees associated with the "bad" label.

The key is the understanding that the features from our data matrix are no longer the dimensions underlying this classification space. We have redefined the basis in terms of landmarks or support vectors. New coffees are placed along dimensions defined by previous training instances. As Pedro Domingos notes (at 33 minutes into the talk), the algorithm relies on analogy, not unlike case-based reasoning. Our new dimensions are more intricate compressed representations of the original features. If this reminds you of archetypal analysis, then you may be on the right track or at least not entirely lost.

The Mad Hatter Explains Support Vector Machines

2016-05-09T15:14:00.000-07:00

"Hatter?" asked Alice, "Why are support vector machines so hard to understand?" Suddenly, before you can ask yourself why Alice is studying machine learning in the middle of the 19th century, the Hatter disappeared. "Where did he go?" thought Alice as she looked down to see a compass painted on the floor below her. Arrows pointed in every direction with each one associated with a word or phrase. One arrow pointed toward the label "Tea Party." Naturally, Alice associated Tea Party with the Hatter, so she walked in that direction and ultimately found him.

"And now," the Hatter said while taking Alice's hand and walking through the looking glass. Once again, the Hatter was gone. This time there was no compass on the floor. However, the room was filled with characters, some that looked more like Alice and some that seemed a little closer in appearance to the Hatter. With so many blocking her view, Alice could see clearly only those nearest to her. She identified the closest resemblance to the Hatter and moved in that direction. Soon she saw another that might have been his relative. Repeating this process over and over again, she finally found the Mad Hatter.

Alice did not fully comprehend what the Hatter told her next. "The compass works only when the input data separates Hatters from everyone else. When it fails, you go through the looking glass into the observation space where all we have is resemblance or similarity. Those who know me will recognize me and all that resemble me. Try relying on a feature like red hair and you might lose your head to the Red Queen. We should have some tea with Wittgenstein and discuss family resemblance. It's derived from features constructed out of input that gets stretched, accentuated, masked and recombined in the most unusual ways."

The Hatter could tell that Alice was confused. Reassuringly, he added, "It's a acquired taste that takes some time. We know we have two classes that are not the same. We just can't separate them from the data as given. You have to look it in just the right way. I'll teach you the Kernel Trick." The Mad Hatter could not help but laugh at his last remark - looking at in in just the right way could be the best definition of support vector machines.

Note: Joseph Rickert's post in R Bloggers shows you the R code to run support vector machines (SVMs) along with a number of good references for learning more. My little fantasy was meant to draw some parallels between the linear algebra and human thinking (see Similarity, Kernels, and the Fundamental Constraints on Cognition for more). Besides, Tim Burton will be releasing soon his movie Alice Through the Looking Glass, and the British Library is celebrating 150 years since the publication of Alice in Wonderland. Both Alice and SVMs invite you to go beyond the data as inputted and derive "impossible" features that enable differentiation and action in a world at first unseen.

When Choice Modeling Paradigms Collide: Features Presented versus Features Perceived

2016-04-03T16:20:00.000-07:00

What is the value of a product feature? Within a market-based paradigm, the answer is the difference between revenues with and without the feature. A product can be decomposed into its features, each feature can be assigned a monetary value by including price in the feature list, and the final worth of the product is a function of its feature bundle. The entire procedure is illustrated in an article using the function rhierMnlMixture from the R package bayesm (Economic Valuation of Product Features). Although much of the discussion concentrates on a somewhat technical distinction between willingness-to-pay (WTP) and willingness-to-buy (WTB), I wish to focus instead on the digital camera case study in Section 6 beginning on page 30. If you have question concerning how you might run such an analysis in R, I have two posts that might help: Let's Do Some Hierarchical Bayes Choice Modeling in R and Let's Do Some More Hierarchical Bayes Choice Modeling in R.

As you can see, the study varies seven factors, including price, but the goal is to estimate the economic return from including a swivel screen on the back of the digital camera. Following much the same procedure as that outlined in those two choice modeling posts mentioned in the last paragraph, each respondent saw 16 hypothetical choice sets created using a fractional factorial experimental design. There was a profile associated with each of the four brands, and respondents were asked to first select the one they most preferred and then if they would buy their most preferred brand at a given price.

The term "dual response" has become associated with this approach, and several choice modelers have adopted the technique. If the value of the swivel screen is well-defined, it ought not matter how you ask these questions, and that seems to be confirmed by some in the choice modeling community. However, outside the laboratory and in the field, commitment or stated intention is the first step toward behavior change. Furthermore, the mere-measurement effect in survey research demonstrates that questioning by itself can alter preferences. Within the purchase context, consumers do not waste effort deciding which of the rejected alternatives is the least objectionable by attending to secondary features after failing to achieve consideration on one or more deal breakers (i.e., the best product they would not buy). Actually, dual response originates as a sales technique because encouraging commitment to one of the offerings increases the ultimate purchase likelihood.

We have our first collision. Order effects are everywhere. It is one of the most robust findings in measurement. The political pollster wants to know how big a sales tax increase could be passed in the next election. You get a different answer when you ask about a one-quarter percent increase followed by one-half percent than when you reverse the order. Perceptual contrast is unavoidable so that one-half seems bigger after the one-quarter probe. I do not need to provide a reference because everyone is aware of order as one of the many context effects. The feature presented is not the feature perceived.

Our second collision occurs from the introduction of price as just another feature, as if in the marketplace no one ever asks why one brand is more expensive than another. We ask because price is both a sacrifice with a negative impact and a signal of quality with a positive weight. In fact, as one can see from the pricing literature, there is nothing simple or direct about price perception. Careful framing may be needed (e.g., maintaining package size but reducing the amount without changing price). Otherwise, the reactions can be quite dramatic for price increases can trigger attributions concerning the underlying motivation and can generate a strong emotional response (e.g., price fairness).

At times, the relationship between the feature presented and the feature perceived can be more nuanced. It would be reasonable to vary gasoline prices in terms of cost per unit of measurement (e.g., dollars per gallon or euros per liter). Yet, the SUV driver seems to react in an all-or-none fashion only when some threshold on the cost to fill up their tank has been exceeded. What is determinant is not the posted price but the total cost of the transaction. Thus, price sensitivity is a complex nonlinear function of cost per unit depending on how often one fills up with gasoline and the size of that tank. In addition, the pain at the pump depends on other factors that fail to make it into a choice set. How long will the increases last? Are higher prices seen as fair? What other alternatives are available? Sometimes we have no option but to live with added costs, reducing our dissonance by altering our preferences.

We see none of this reasoning in choice modeling where the alternatives are described as feature bundles outside of any real context. The consumer "plays" the game as presented by the modeler. Repeating the choice exercise with multiple choice sets only serves to induce a "feature-as-presented" bias. Of course, there are occasions when actual purchases look like choice models. We can mimic repetitive purchases from the retail shelf with a choice exercise, and the same applies to online comparison shopping among alternatives described by short feature lists as long as we are careful about specifying the occasion and buyers do not search for user comments.

User comments bring us back to the usage occasion, which tends to be ignored in choice modeling. Reading the comments, we note that one customer reports the breakage of the hinge on the swivel screen after only a few months. Is the swivel screen still an advantage or a potential problem waiting to occur? We are not buying the feature, but the benefit that the feature promises. This is the scene of another paradigm collision. The choice modeler assumes that features have value that can be elicited by merely naming the feature. They simplify the purchase task by stripping out all contextual information. Consequently, the resulting estimates work within the confines of their preference elicitation procedures, but do not generalize to the marketplace.

We have other options in R, as I have suggested in my last two posts. Although the independent variables in a choice model are set by the researcher, we are free to transform them, for instance, compute price as a logarithm or fit low-order polynomials of the original features. We are free to go farther. Perceived features can be much more complex and constructed as nonlinear latent variables from the original data. For example, neural networks enable us to handle a feature-rich description of the alternatives and fit adaptive basis functions with hidden layers.

On the other hand, I have had some success exploiting the natural variation within product categories with many offerings (e.g., recommender systems for music, movies, and online shopping like Amazon). By embedding measurement within the actual purchase occasion, we can learn the when, why, what, how and where of consumption. We might discover the limits of a swivel screen in bright sunlight or when only one hand is free. The feature that appeared so valuable when introduced in the choice model may become a liability after reading users' comments.

Features described in choice sets are not the same features that consumers consider when purchasing and imagining future usage. This more realistic product representation requires that we move from those R packages that restrict the input space (choice modeling) to those R packages that enable the analysis of high-dimensional sparse matrices with adaptive basis functions (neural networks and matrix factorization).

Bottom Line: The data collection process employed to construct and display options when repeated choice sets are presented one after another tends to simplify the purchase task and induce a decision strategy consistent with regression models we find in several R packages (e.g., bayesm, mlogit, and RChoice). However, when the purchase process involves extensive search over many offerings (e.g., music, movies, wines, cheeses, vacations, restaurants, cosmetics, and many more) or multiple usage occasions (e.g., work, home, daily, special events, by oneself, with others, involving children, time of day, and other contextual factors), we need to look elsewhere within R for statistical models that allow for the construction of complex and nonlinear latent variables or hidden layers that serve as the derived input for decision making (e.g., R packages for deep learning or matrix factorization).

Choice Modeling with Features Defined by Consumers and Not Researchers

2016-03-25T17:48:00.001-07:00

Choice modeling begins with a researcher "deciding on what attributes or levels fully describe the good or service." This is consistent with the early neural networks in which features were precoded outside of the learning model. That is, choice modeling can be seen as learning the feature weights that recognize whether the input was of type "buy" or not.

As I have argued in the previous post, the last step in the purchase task may involve attribute tradeoffs among a few differentiating features for the remaining options in the consideration set. The aging shopper removes two boxes of cereal from the well-stocked supermarket shelves and decides whether low-sodium beats low-fat. The choice modeler is satisfied, but the package designer wants to know how these two boxes got noticed and selected for comparison. More importantly for the marketer, how is the purchase being framed by the consumer? Is it advertising that focused attention on nutrition? Was it health claims by other cereal boxes nearby on the same shelf?

With caveats concerning the need to avoid caricature, one can describe this conflict between the choice modeler and the marketer in terms of shallow versus deep learning (see slide #2 from Yann LeCun's 2013 tutorial with video here). From this perspective, choice modeling is a form of more shallow information integration where the features are structured (varied according to some experimental design) and presented in a simplified format (the R package support.CEs aids in this process and you can find R code for hierarchical Bayes using bayesm in this link).

Choice modeling or information integration is illustrated on the upper left of the above diagram. The capital S's are the attribute inputs that are translated into utilities so that they can be evaluated on a common value scale. Those utilities are combined or integrated and yield a summary measure that determines the response. For example, if low-fat were worth two units and low-sodium worth only one unit, you would buy the low-fat cereal. The modeling does not scale well, so we need to limit the number of feature levels. Moreover, in order to obtain individual estimates, we require repeated measures from different choice sets. The repetitive task encourages us to streamline the choice sets so that feature tradeoffs are easier to see and make. The constraints of an experimental design force us toward an idealized presentation so that respondents have little choice but information integration.

Deep learning, on the other hand, has multiple hidden layers that model feature extraction by the consumer. The goal is to eat a healthy cereal that is filling and tastes good. Which packaging works for you? Does it matter if the word "fiber" is included? We could assess the impact of the fiber labeling by turning it on and off in an experimental design. But that only draws attention to the features that are varied and limits any hope of generalizing our findings beyond the laboratory. Of course, it depends on whether you are buying for an adult or a child, and whether the cereal is for breakfast or a snack. Contextual effects force us to turn to statistical models that can handle the complexities of real world purchase processes.

R does offer an interface to deep learning algorithms. However, you can accomplish something similar with nonnegative matrix factorization (NMF). The key is not to force a basis onto the statistical analysis. Specifically, choice modeling relies on a regression analysis with the features as the independent variables. We can expand this basis by adding transformations of the original features (e.g., the log of price or inserting polynomial expansions of variables already in the model). However, the regression equation will reveal little if the consumer infers some hidden or latent features from a particular pattern of feature combinations (e.g., a fragment of the picture plus captions along with the package design triggers childhood memories or activates aspirational drives).

Deep learning excels with the complexities of language and vision. NMF seems to work well in the more straightforward world of product preference. As an example, Amazon displays several thousand cereals that span much of what is available in the marketplace. We can limit ourselves to a subset of the 100 or more most popular cereals and ask respondents to indicate their interest in each cereal. We would expect a sparse data matrix with blocks of joint groupings of both respondents with similar tastes and cereals with similar features (e.g., variation on flakes, crunch or hot cereals). The joint blocks define the hidden layers simultaneously clustering respondents and typing products.

Matrix factorization or decomposition seeks to reconstruct the data in a matrix from a smaller number of latent features. I have discussed its relationship to deep learning in a post on product category representation. It ends with a listing of examples that include the code needed to run NMF in R. You can think of NMF as a dual factor analysis with a common set of factors for both rows (consumers) and columns (cereals in this case). Unlike principal component or factor analysis, there are no negative factor loadings, which is why NMF is nonnegative. The result is a data matrix reconstructed from parts that are not imposed by the statistician but revealed in the attempt to reproduce the consumer data.

We might expect to find something similar to what Jonathan Gutman reported from a qualitative study using a means-end analysis. I have copied his Figure 3 showing what consumers said when asked about crunchy cereals. Of course, all we obtain from our NMF are weights that look like factor loadings for respondents and cereals. If there is a crunch factor, you will see all the cereals with crunch loading on that hidden feature with all the respondents wanting crunch with higher weights on the same hidden feature. Obviously, in order to know which respondents wanted something crunchy in their cereal, you would need to ask a separate question. Similarly, you might inquire about cereal perceptions or have experts rate the cereals to know which cereals produce the biggest crunch. Alternatively, one could cluster the respondents and cereals and profile those clusters.

Understanding Statistical Models Through the Datasets They Seek to Explain: Choice Modeling vs. Neural Networks

2016-03-21T11:44:00.000-07:00

R may be the lingua franca, yet many of the packages within the R library seem to be written in different languages. We can follow the R code because we know how to program but still feel that we have missed something in the translation.

R provides an open environment for code from different communities, each with their own set of exemplars, where the term "exemplar" has been borrowed from Thomas Kuhn's work on normal science. You need only to examine the datasets that each R package includes to illustrate its capabilities in order to understand the diversity of paradigms spanned. As an example, the datasets from the Clustering and Finite Mixture Task View demonstrate the dependence of the statistical models on the data to be analyzed. Those seeking to identifying communities in social networks might be using similar terms as those trying to recognize objects in visual images, yet the different referents (exemplars) change the meanings of those terms.

Thinking in Terms of Causes and Effects

Of course, there are exceptions, for instance, regression models can be easily understood across applications as the "pulling of levers" especially for those of us seeking to intervene and change behavior (e.g., marketing research). Increased spending on advertising yields greater awareness and generates more sales, that is, pulling the ad spending lever raises revenue (see the R package CausalImpact). The same reasoning underlies choice modeling with features as levers and purchase as the effect (see the R package bayesm).

The above picture captures this mechanistic "pulling the lever" that dominates much of our thinking about the marketing mix. The exemplar "explains" through analogy. You might prefer "adjusting the dials" as an updated version, but the paradigm remains cause-and-effect with each cause separable and under the control of the marketer. Is this not what we mean by the relative contribution of predictors? Each independent variable in a regression equation has its own unique effect on the outcome. We pull each lever a distance of one standard deviation (the beta weight), sum the changes on the outcome (sometimes theses betas are squared before adding), and then divide by the total.

The Challenge from Neural Networks

So, how do we make sense of neural networks and deep learning? Is the R package neuralnet simply another method for curve fitting or estimating the impact of features? Geoffrey Hinton might think differently. The Intro Video for Coursera's Neural Networks for Machine Learning offers a different exemplar - handwritten digit recognition. If he is curve fitting, the features are not given but extracted so that learning is possible (i.e., the features are not obvious but constructed from the input to solve the task at hand). The first chapter of Michael Nielsen's online book, Using Neural Nets to Recognize Handwritten Digits, provides the details. Isabelle Guyon's pattern recognition course adds an animated gif displaying visual perception as an active process.

On the other hand, a choice model begins with the researcher deciding what features should be varied. The product space is partitioned and presented as structured feature lists. What alternative does the consumer have, except to respond to variations in the feature levels? I attend to price because you keep changing the price. Wider ranges and greater variation only focus my attention. However, in real setting the shelves and the computer screens are filled with competing products waiting for consumers to define their own differentiating features. Smart Watches from Google Shopping provides a clear illustration of the divergence of purchase processes in the real world and in the laboratory.

To be clear, when the choice model and the neural network speak of input, they are referring to two very different things. The exemplars from choice modeling are deciding how best to commute and comparing a few offers for same product or service. This works when you are choosing between two cans of chicken soup by reading the ingredients on their labels. It does not describe how one selects a cheese from the huge assortment found in many stores.

Neural networks take a different view of the task. In less than five minutes Hinton's video provides the exemplar for representation learning. Input enters as it does in real settings. Features that successfully differentiate among the digits are learned over time. We see that learning in the video when the neural net generates its own handwritten digits for the numbers 2 and 8. It is not uncommon to write down a number that later we or others have difficulty reading. Legibility is valued so that we can say that an easier to read "2" is preferred over a "2" that is harder to identify. But what makes one "2" a better two than another "2" takes some training, as machine learning teaches us.

We are all accomplished at number recognition and forget how much time and effort it took to reach this level of understanding (unless we know young children in the middle of the learning process). What year is MCMXCIX? The letters are important, but so are their relative positions (e.g. X=10 and IX=9 in the year 1999). We are not pulling levers any more, at least not until the features have been constructed. What are those features in typical choice situations? What you want to eat for breakfast, lunch or dinner (unless you snack instead) often depends on your location, available time and money, future dining plans, motivation for eating, and who else is present (context-aware recommender systems).

Adopting a different perspective, our choice modeler sees the world as well-defined and decomposable into separate factors that can be varied systematically according to some experimental design. Under such constraints the consumer behaves as the model predicts (a self-fulling prophecy?). Meanwhile, in the real world, consumers struggle to learn a product representation that makes choice possible.

Thinking Outside the Choice Modeling Box

The features we learn may be relative to the competitive set, which is why adding a more expensive alternative makes what is now the mid-priced option appear less expensive. Situation plays an important role for the movie I view when alone is not the movie I watch with my kids. Framing has an impact, which is why advertising tries to convince you that an expensive purchase is a gift that you give to yourself. Moreover, we cannot forget intended usage for that Smartphone is a camera, a GPS, and I believe you get the point. We may have many more potential features than included in our choice design.

It may be the case that the final step before purchase can be described as a tradeoff among a small set of features varying over only a few alternatives in our consideration set. If we can mimic that terminal stage with a choice model, we might have a good chance to learn something about the marketplace. How did the consumer get to that last choice point? Why these features and those alternative products or services? In order to answer such questions, we will need to look outside the choice modeling box.

A Data Science Solution to the Question "What is Data Science?"

2016-01-08T14:09:00.000-08:00

As this flowchart from Wikipedia illustrates, data science is about collecting, cleaning, analyzing and reporting data. But is it data science or just or a "sexed up term" for Statistics (see embedded quote by Nate Silver)? It's difficult to separate the two at this level of generality, so perhaps we need to define our terms.

We begin by making a list of all the stuff that a data scientist might do or know. We are playing a game where the answer is "data scientist" and the questions are "Do they do this?" and "Do they know that?". However, the "this" and the "that" are very specific. For example, "Data is Processed" can range from simple downloading to the complex representation of visual or speech input. What precisely does a data scientist do when they process data that a programmer or a statistician does not do?

To be clear, I am constructing a very long questionnaire that I intend to distribute to individuals calling themselves data scientists along with everyone else claiming that they too do data science, although by another name. A checklist will work in our game of Twenty Questions as long as the list is detailed and exhaustive. You are welcome to add suggestions as comments to this post, but we can start by expanding on each of the boxes in the above data science flowchart.

Since I am a marketing researcher, I am inclined to analyze the resulting data matrix as if it were a shopping cart filled with items purchased from a grocery store or an inventory of downloads from a video or music provider. The rows are respondents, and the columns are all the questions that might be asked to distinguish among all the various players. Let's not include sexy as a column.

You may have guessed that I am headed toward some type of matrix factorization. Can we recognize patterns in the columns that reflect different configurations of study and behavior? Are there communities composed of rows clustered together with similar practices and experiences? R provides most of us who have some experience running factor and cluster analyses with a "doable" introduction to non-negative matrix factorization (NMF). You can think of it as simultaneous clustering of the rows and columns in a data matrix. My blog is filled with examples, none of which are easy, but none of which are incomprehensible or beyond your ability to adapt to your own datasets.

What are we likely to find? Will we discover something like anchor words from topic modeling? For instance, it is necessary to work with multiple datasets from different disciplines to be a data scientist? Would I stop calling myself a marketing scientist if I started working with political polling data? Some argue that one becomes a statistician when they begin consulting with others from divergent fields of study.

What about teaching to students with varied backgrounds in universities or industry? Do we call it data science if one writes and distributes software that others can apply with data across diverse domains? Does proving theorems make one a statistician? How many languages must one know before they are a programmer? What role does computation play when making such discriminations?

What will we learn from dissecting the "corpus" (the detailed body of what we do and know summarized by the boxes in the above data science process)? Extending this analogy, I am recommending that the "physician, heal thyself" by applying data science methodology to provide a response to the "What is Data Science?" question.

Hopefully, we can avoid the hype and the caricature from the popular press (sexiest job of 21st century). Moreover, I suggest that we resist the tendency to think metaphorically in terms of contrasting ideals. The simple act of comparing statisticians and data scientists shapes our perceptions and leads us to see the two as more dissimilar than suggested by their training and behavior. The distinction may be more nuance than substance, reflecting what excites and motivates rather than what is known or done. The basis for separation may reside in how much personal satisfaction is derived from the subject matter or the programming rather than the computational algorithm or the generative model.

Modeling How Consumers Simplify the Purchase Process by Copying Others

2015-12-23T14:00:00.000-08:00

A Flower That Fits the Bill

Marketing borrows the biological notion of coevolution to explain the progressive "fit" between products and consumers. While evolutionary time may seem a bit slow for product innovation and adoption, the same metaphor can be found in models of assimilation and accommodation from cultural and cognitive psychology.

The digital camera was introduced as an alternative to film, but soon redefined how pictures are taken, stored and shared. The selfie stick is but the latest step in this process by which product usage and product features coevolve over time with previous cycles enabling the next in the chain. Is it the smartphone or the lack of fun that's killing the camera?

The diffusion of innovation unfolds in the marketplace as a social movement with the behavior of early adopters copied by the more cautious. For example, "cutting the cord" can be a lifestyle change involving both social isolation from conversations among those watching live sporting events and a commitment to learning how to retrieve television-like content from the Internet. The Diary of a Cord-Cutter in 2015 offers a funny and informative qualitative account. Still, one needs the timestamp because cord-cutting is an evolving product category. The market will become larger and more diverse with more heterogeneous customers (assimilation) and greater differentiation of product offerings (accommodation).

So, we should be able to agree that product markets are the outcome of dynamic processes involving both producers and customers (see Sociocognitive Dynamics in a Product Market for a comprehensive overview). User-centered product design takes an additional step and creates fictional customers or personas in order to find the perfect match. Shoppers do something similar when they anticipate how they will use the product they are considering. User types can be real (an actual person) or imagined (a persona). If this analysis is correct, then both customers and producers should be looking at the same data: the cable TV customer to decide if they should become cord-cutters and the cable TV provider to identify potential defectors.

Identifying the Likely Cord-Cutter

We can ask about your subscriptions: cable TV, internet connection, Netflix, Hulu, Amazon Prime, Sling, and so on). It is a long list, and we might get some frequency of usage data at the same time. This may be all that we need, especially if we probe for the details (e.g., cable TV usage would include live sports, on-demand movies, kid's shows, HBO or other channel subscriptions, and continue until just before respondents become likely to terminate on-line surveys). Concurrently, it might be helpful to know something about your hardware, such as TVs, DVDs, DVRs, media streamers and other stuff.

A form of reverse engineering guides our data collection. Qualitative research and personal experience gives us some idea of the usage types likely to populate our customer base. Cable TV offers a menu of bundled and ala carte hardware and channels. Only some of the alternatives are mutually exclusive; otherwise, you are free to create your own assortment. Internet availability only increases the number of options, which you can watch on a television, a computer, a tablet or a phone. Plus, there are always free broadcast TV captured with an antenna and DVDs that you rent or buy. We ought not to forget DVRs and media streamers (e.g., Roku, Apple TV, Chromecast, and Amazon Fire Stick). Obviously, there is no reason to stop with usage so why not extend the scale to include awareness and familiarity? You might not be a cord-cutter, though you may be on your way if you know all about Sling TV.

Traditional segmentation will not be able to represent this degree of complexity.

Each consumer defines their own personal choices by arranging options in a continually changing pattern that does not depend on existing bundles offered by providers. Consequently, whatever statistical model is chosen must be open to the possibility that every non-contradictory arrangement is possible. Yet, every combination will not survive for some will be dominated by others and never achieve a sustainable audience.

We could display this attraction between consumers and offerings as a bipartite graph (Figure 2.9 from Barabasi's Network Science).

Consumers are listed in U, and a line is drawn to the offerings in V that they might wish to purchase (shown in the center panel). It is this linkage between U and V that produces the consumer and product networks in the two side panels. The A-B and B-C-D cliques of offerings in Projection V would be disjoint without customer U_5. Moreover, the 1-2-3 and 4-5-6-7 consumer clusters are connected by the presence of offering B in V. Removing B or #5 cuts the graph into independent parts.

Actual markets contain many more consumers in U, and the number of choices in V can be extensive. Consumer heterogeneity creates complexities for the marketer trying to discover structure in Projection U. Besides, the task is not any easier for an individual consumer who must select the best from a seemingly overwhelming number of alternatives in Projection V. Luckily, one trick frees the consumer from having to learn all the options that are available and being forced to make all the difficult tradeoffs - simply do as others do (as in observational learning). The other can be someone you know or read about as in the above Diary of a Cord-Cutter in 2015. There is no need for a taxonomy of offerings or a complete classification of user types.

In fact, it has become popular to believe that social diffusion or contagion models describe the actual adoption process (e.g., The Tipping Point). Regardless, over time, the U's and V's in the bipartite interactions of customers and offerings come to organize each other through mutual influence. Specifically, potential customers learn about the cord-cutting persona through the social and professional media and at the same time come to group together those offerings that the cord-cutter might purchase. Offerings are not alphabetized or catalogued as an academic exercise. There is money to be saved and entertainment to be discovered. Sorting needs to be goal-directed and efficient. I am ready to binge-watch, and I am looking for a recommendation.

"I'll Have What She's Having"

It has taken some time to outline how consumers are able to simplify complex purchase process by modeling the behavior of others. It is such a common experience, although rational decision theory continues to control our statistical modeling of choice. As you are escorted to your restaurant table, you cannot help but notice a delicious meal being served next to where you are seated. You refuse a menu and simply ask for the same dish. "I'll Have What She's Having" works as a decision strategy only when I can identify the "she" and the "what" simultaneously.

If we intend to analyze that data we have just talked about collecting, we will need a statistical model. Happily, the R Project for Statistical Computing implements at least two approaches for such joint identification: a latent clustering of a bipartite network in the latentnet package and a nonnegative matrix factorization in the NMF package. The Davis data from the latentnet R package will serve as our illustration. The R code for all the analyses that will be reported can be found at the end of this post.

Stephen Borgatti is a good place to begin with his two-mode social network analysis of the Davis data. The rows are 18 women, the columns are 14 events, and the cells are zero or one depending on whether or not each woman attended each event. The nature of the events has not been specified, but since I am in marketing, I prefer to think of the events as if they were movies seen or concerts attended (i.e., events requiring the purchase of tickets). You will find a latentnet tutorial covering the analysis of this same data as a bipartite network (section 6.3). Finally, a paper by Michael Brusco called "Analysis of two-mode network data using nonnegative matrix factorization" provides a detailed treatment of the NMF approach.

We will start with the plot from the latentnet R package. The names are the women in the rows and the numbered E's are the events in the columns. The events appear to be separated into two groups of E1 to E6 toward the top and E9 to E14 toward the bottom. E7 and E8 seem to occupy a middle position. The names are also divided into an upper and lower grouping with Ruth and Pearl falling between the two clusters. Does this plot not look similar to the earlier bipartite graph from Barabasi? That is, the linkages between the women and the events organize both into two corresponding clusters tied together by at least two women and two events.

The heatmaps from the NMF reveal the same pattern for the events and the women. You should recall that NMF seeks a lower dimensional representation that will reproduce the original data table with 0s and 1s. In this case, two basis components were extracted. The mixture coefficients for the events vary from 0 to 1 with a darker red indicating a higher contribution for that basis component. The first six events (E1-E6) form the first basis component with the second basis component containing the last six events (E9-E14). As before, E7 and E8 share a more even mixture of the two basis components. Again, the most of the women load on one basis component or the other with Ruth and Pearl traveling freely between both components. As you can easily verify, the names form the same clusters in both plots.

It would help to know something about the events and the women. If E1 through E6 were all of a certain type (e.g., symphony concerts), then we could easily name the first component. Similarly, if all of the women in red at bottom of our basis heatmap played the piano, our results would have at least face validity. A more detailed description of this naming process can be found in a previous example called "What Can We Learn from the Apps on Your Smartphone?". Those wishing to learn more might want to review the link listed at the end of that post in a note.

Which events should a newcomer attend? If Helen, Nora, Sylvia and Katherine are her friends, the answer is the second cluster of E9-E14. The collaborative filtering of recommender systems enables a novice to decide quickly and easily without a rational appraisal of the feature tradeoffs. Of course, a tradeoff analysis will work as well for we have a joint scaling of products and users. If the event is a concert with a performer you love, then base your decision on a dominating feature. When in tradeoff doubt, go along with your friends.

Finally, brand management can profit from this perspective. Personas work as a design strategy when user types are differentiated by their preference structures and a single individual can represent each group. Although user-centered designers reject segmentations that are based on demographics, attitudes, or benefit statements, a NMF can get very specific and include as many columns as needed (e.g., thousands of movie and even more music recordings). Furthermore, sparsity is not a problem and most of the rows can be empty.

There is no reason why each of the basis components in the above heatmaps could not be summarized by one person and/or one event. However, NMF forms building blocks by jointly clustering many rows and columns. Every potential customer and every possible product configuration are additive compositions built from these blocks. Would not design thinking be better served with several exemplars of each user type rather than trying to generalize from a single individual? Plus, we have the linked columns telling us what attracts each user type in the desired detail provided by the data we collected.

R Code to Produce Plots

library(latentnet)
data(davis)
davis.fit<-ergmm(davis~bilinear(d=2)+rsociality)
plot(davis.fit,pie=TRUE,rand.eff="sociality",labels=TRUE)
 
library(NMF)
data_matrix<-as.matrix.network(davis)
fit<-nmf(data_matrix, 2, "lee", nrun=20)
par(mfrow = c(1, 2))
basismap(fit)
coefmap(fit)

Created by Pretty R at inside-R.org

BayesiaLab-Like Network Graphs for Free with R

2015-12-18T15:58:00.001-08:00

My screen has been filled with ads from BayesiaLab since I downloaded their free book. Just as I began to have regrets, I received an email invitation to try out their demo datasets. I was especially interested in their perfume ratings data. In this monadic product test, each of 1,321 French women was presented with only one of 11 perfumes and asked to evaluate on a 10-point scale a series of fragrance-related adjectives along with a few user-imagery descriptors. I have added the 6-point purchase intent item to the analysis in order to assess its position in this network.

Can we start by looking at the partial correlation network? I will refer you to my post on Driver Analysis vs. Partial Correlation Analysis and will not repeat that more detailed overview.

Each of the nodes is a variable (e.g., purchase intent is located on the far right). An edge drawn between any two nodes shows the partial correlation between those two nodes after controlling for all the other variables in the network. The color indicates the sign of the partial correlation with green for positive and red for negative. The size of the partial correlation is indicated by the thickest of the edge.

Simply scanning the map reveals the underlying structure of global connections among even more strongly joined regions:

Northwest - In Love / Romantic / Passionate / Radiant,
Southwest - Bold / Active / Character / Fulfilled / Trust / Free,
Mid-South - Classical / Tenacious / Quality / Timeless / High End,
Mid-North - Wooded / Spiced,
Center - Chic / Elegant / Rich / Modern,
Northeast - Sweet / Fruity / Flowery / Fresh, and
Southeast - Easy to Wear / Please Others / Pleasure.

Unlike the Probabilistic Structural Equation Model (PSEM) in Chapter 8 of BayesiaLab's book, my network is undirected because I can find no justification for assigning causality. Yet, the structure appears to be much the same for the two analyses, for example, compare this partial correlation network with BayesiaLab's Figure 8.2.3.

All this looks very familiar to those of us who have analyzed consumer rating scales. First, we expect negative skew and high collinearity because consumers tend to give ratings in the upper end of the scale and their responses often are highly intercorrelated. In fact, the first principal component accounted for 64% of the total variation, and it would have been higher had Wooded and Spiced been excluded from the battery.

A more cautious researcher might stop with extracting a single dimension and simply concluding that the women either liked or disliked the perfumes they tested and rated everything either uniformly higher or lower. They would speak of halo effects and question whether any more than an overall score could be extracted from the data. Nevertheless, as we see from the above partial correlation network, there is an interpretable local structure even when all the variables are highly interrelated.

I have discussed this issue before in a post about separating global from specific factors. The bifactor model outlined in that post provides another view into the structure of the perfume rating data. What if there were a global factor explaining what we might call the "halo effect" (i.e., uniformly high correlations among all the variables) and then additional specific factors accounting for the extra correlation among different subsets of variables (e.g., the regions in the above partial correlation network map)?

The bifactor diagram shown below may not be pretty with so many variables to be arrayed. However, you can see the high factor loadings radiating out from the global factor g and how the specific factors F1* through F6* provide a secondary level of local structure corresponding to the regions identified in the above network.

I will end with a technical note. The 1321 observations were nested within the 11 perfumes with each respondent seeing only one perfume. Although we would not expect the specific perfume rated to alter the correlations (factorial invariance), mean-level differences between the perfumes could inflate the correlations calculated over the entire sample. In order to test this, I reran the analysis with deviation scores by subtracting the corresponding mean perfume score from each respondent's original ratings. The results were essentially the same.

R Code Needed to Import CSV File and Produce Plots

# Set working directory and import data file
setwd("C:/directory where file located")
perfume<-read.csv("Perfume.csv", sep=";")
apply(perfume, 2, function(x) table(x,useNA="always"))
 
# Calculates Sparse Partial Correlation Matrix
library("qgraph")
sparse_matrix<-EBICglasso(cor(perfume[,2:48]), n=1321)
qgraph(sparse_matrix, layout="spring", 
       label.scale=FALSE, labels=names(perfume)[2:48],
       label.cex=1, node.width=.5)
 
library(psych)
# Purchase Intent Not Included
scree(perfume[,3:48])
omega(perfume[,3:48], nfactors=6)

Created by Pretty R at inside-R.org

Attitudes Modeled as Networks

2015-12-10T11:06:00.000-08:00

In case you missed it, Jonas Dalege and his colleagues at the PsychoSystems research group have recently published an article in Psychological Review detailing how attitudes can be represented as network graphs. It is all done using R and a dataset that can be downloaded by registering at the ANES data center. You will find the R code under Scripts and Code in a file called ANES 1984 Analyses. With very minor changes to the size of some labeling, I was able to reproduce the above undirected graph with two R packages: IsingFit and qgraph. As usual when downloading others' files, most of the R code is data munging and deals with assigning labels and transforming ratings into dichotomies.

The above graph represents the conditional independence relationships among node pairings. Specifically, edges are drawn between pairs of nodes only if they are still related after controlling for all the other nodes not in that pair. The center nodes in red are assessments of Ronald Reagan's ability, decency and caring. The groupings of the red nodes seem reasonable, for example, the thicker green edges connected knowledgeable, hard-working, decent and moral. Similarly, in touch, understands and cares are also drawn together by stronger relationships. These evaluative judgments are joined by positive green edges to the respondents' feelings of pride and hope (blue nodes). Moreover, they are pushed away by negative red pathways from darker emotional reactions such as fear, anger and disgust (green nodes).

One should not be surprised to learn that it makes a difference whether the attitudes are scored dichotomously (e.g., yes/no, agree/disagree or present/absent) or using some ordinal rating scale. If it helps, you can think of this as you might the distinction between regression (continuous) and classification (discrete) in statistical learning theory. Thus, when I analyzed a set of mobile phone ratings gathered with 10-point scales, I borrowed a graphical lasso model called EBICglasso from the qgraph R package (see Undirected Graphs When the Causality is Mutual). On the other hand, the Ising model from the IsingFit R package was needed when the data came from yes/no checklists (see The Network Underlying Consumer Perceptions of the European Car Market).

The Topology Underlying the Brand Logo Naming Game: Unidimensional or Local Neighborhoods?

2015-12-04T12:13:00.001-08:00

You can find the app on iTunes and Google Play. It's a game of trivial pursuits - here's the logo, now tell me the brand. Each item is scored as right or wrong, and the players must take it all very seriously for there is a Facebook page with cheat sheets for improving one's total score.

Psychometrics Sees Everything as a Test

What would a psychometrician make of such a game based on brand logo knowledge? Are we measuring one's level of consumerism ("a preoccupation with and an inclination toward buying consumer goods")? Everyone knows the most popular brands, but only the most involved are familiar with logos of the less publicized products. The question for psychometrics is whether they are able to explain the logos that you can identify correctly by knowing only your level of consumption.

For example, if you were a car enthusiast, then you would be able to name all the car logos in the above table. However, if you did not drive a car or watch commercial television or read car ads in print media, you might be familiar with only the most "popular" logos (i.e., the ones that cannot be avoided because their signage is everywhere you look). We make the assumption that everyone falls somewhere between these two extremes along a consumption continuum and assess whether we can reproduce every individual pattern of answers based solely on their location on this single dimension. Shopping intensity or consumerism is the path, and logo identifications are the sensors along that path.

Specifically, if some number of N respondents played this game, it would not be difficult to rank order the 36 logos in the above table along a line stretching from 0% to 100% correct identification. Next, we examine each respondent, starting by sorting the players from those with the fewest correct identifications to those getting the most right. As shown in an earlier post, a heatmap will reveal the relationship between the ease of identifying each logo and the overall logo knowledge for each individual, as measured by their total score over all the brand logos. [The R code required to simulate the data and produce the heatmap can be found at the end of this post.]

You can begin by noting that blue is correct and red is not. Thus, the least knowledgeable players are in the top rows filled with the most red and the least blue. The logos along the x-axis are sorted by difficulty with the hardest to name on the left and the easiest on the right. In general, better players tend to know the harder logos. This is shown by the formation of a blue triangle as one scans towards the lower, right-hand corner. We call this a Guttman scale, and it suggests that both variation among the logos and the players can be described by a single dimension, which we might call logo familiarity or brand presence. However, one must be wary of suggestive names like "brand presence" for over time we forget that we are only measuring logo familiarity and not something more impactful.

Our psychometrician might have analyzed this same data using the R package ltm for latent trait modeling. A hopefully intuitive introduction to item response modeling was posted earlier on this blog. Those results could be summarized with a series of item characteristic curves displaying the relationship between the probability of answering correctly and the underlying trait, labeled ability by default.

As you see in the above plot, the items are arranged from the easiest (V1) to the hardest (V36) with the likelihood of naming the logo increasing as a logistic function of the unobserved consumerism measured as z-scores and called ability because item response theory (IRT) originated in achievement testing. These curves are simple to read and understand. A player with low consumption (e.g., a z-score near -2) has a better than even chance of identifying the most popular logos, but almost zero probability of naming any of the least familiar logos. All those probabilities move up their respective S-curves together as consumers become more involved.

In this example the function has been specified for I have plotted the item characteristics curves from the one-parameter Rasch model. However, a specific functional form is not required, and we could have used the R package KernSmoothIRT to fit a nonparametric model. The topology remains a unidimensional manifold, something similar to Hastie's principal curve in the R package princurve. Because the term has multiple meanings, I should note that I am using "topology" in a limited sense in order to refer to the shape of the data and not as in topological data analysis.

To be clear, there must be powerful forces at work to constrain logo naming to a one-dimensional continuum. Sequential skills that build on earlier achievements can often be described by a low-dimensional manifold (e.g., learning descriptive statistics before attempting inference since the latter assumes knowledge of the former). We would have needed a different model had our brands been local so that higher shopping intensity would have produced greater familiarity only for those logos available in a given locality (e.g., country-specific brands without an international presence).

The Meaning of Brand Familiarity Depends on Brand Presence in Local Markets

Now, it gets interesting. We started with players differentiated by a single parameter indicating how far they had traveled along a common consumption path. The path markers or sensors are the logos arrayed in decreasing popularity. Everyone shares a common environment with similar exposures to the same brand logos. Most have seen the McDonald's double-arcing M or the Nike swoosh because both brands have spent a considerable amount of money to buy market presence. On the other hand, Hilton's "blue H in the swirl" with less market presence would be recognized less often (fourth row and first column in the above brand logo table).

But what if market presence and thus logo popularity depended on your local neighborhood? Even international companies have differential presence in different countries, as well as varying concentration within the same country. Spending and distribution patterns by national, regional and local brands create clusters of differential market presence. Everyone does not share a common logo exposure so that each cluster requires its own brand list. That is, consumers reside in localities with varying degrees of brand presence so that two individuals with identical levels of consumption intensity or consumerism would not be familiar with the same brand logos. Consequently, we need to add a second parameter to each individual's position along a path specific to their neighborhood. The psychometrician calls this differential item functioning (DIF), and R provides a number of ways of handling the additional mixture parameter.

Overlapping Audiences in the Marketplace of Attention

You may have anticipated the next step as the topology becomes more complex. We began with one pathway marked with brand logos as our sensors. Then, we argued for a mixture model with groups of individuals living in different neighborhoods with different ordering of the brand logos. Finally, we will end by allowing consumers to belong to more than one neighborhood with whatever degree of belonging they desire. We are describing the kind of fragmentation that occurs when consumers seize control and there is more available to them than they can attend to or consider. James Webster outlines this process of audience formation in his book The Marketplace of Attention.

The topology has changed again. There are just too many brand logos, and unless it becomes a competitive game, consumers will derive diminishing returns from continuing search and they typically will stop sooner rather than later. It helps that the market comes preorganized by providers trying to make the sale. Expert reviews and word of mouth guide the search. Yet, it is the consumer who decides what to select from the seemingly endless buffet. In the process, an individual will see and remember only a subset of all possible brand logos. We need a new model - one that simultaneously sorts both rows and columns by grouping together consumers and the brand logos that they are likely to recognize.

A heatmap may help to explain what can be accomplished when we search for joint clusterings of the rows and columns (also known as biclustering). Using an R package for nonnegative matrix factorization (NMF), I will simulate a data set with such a structure and show you the heatmap. Actually, I will display two heatmaps, one without noise so that you can see the pattern and a second with the same pattern but with added noise. Hopefully, the heatmap without noise will enable you to see the same pattern in the second heatmap with additional distortions.

I kept the number of columns at 36 for comparison with the first one-dimensional heatmap that you saw toward the beginning of this post. As before, blue is one, and red is zero. We discover enclaves or silos in the first heatmap without noise (polarization). The boundaries become fuzzier with random variation (fragmentation). I should note that you can see the biclusters in both heatmaps without reordering the rows and columns only because this is how the simulator generates the data. If you wish to see how this can be done with actual data, I have provided a set of links with the code needed to run a NMF in R at the end of my post on Brand and Product Category Representation.

Finally, although we speak of NMF as a form of simultaneous clustering, the cluster membership are graded rather than all-or-none (soft vs. hard clustering). This yields a very flexible and expressive topology, which becomes clear when we review the three alternative representations presented in this post. First, we saw how some highly structured data matrices can be reproduced using a single dimension with rows and columns both located on the same continuum (IRT). Next, we asked if there might be discrete groups of rows with each row cluster having its own unique ordering of the columns (mixed IRT). Lastly, we sought a model of audience formation with rows and columns jointly collected together into blocks with graded membership for both the rows and the columns (NMF).

Knowledge is organized as a single dimension when learning is formalized within a curriculum (e.g., a course at an educational institution) or accumulative (e.g., need to know addition before one can learn multiplication). However, coevolving networks of customers and products cannot be described by any one dimension or even a finite mixture of different dimensions. The Internet creates both microgenres and fragmented audiences that require their own topology.

R Code to Produce Figures in this Post

# use psych package to simulate latent trait data

library(psych)
logos<-sim.irt(nvar=36, n=500, mod="logistic")
 
# Sort data by both item mean
# and person total score
item<-apply(logos$items,2,mean)
person<-apply(logos$items,1,sum)
logos$itemsOrd<-logos$items[order(person),order(item)]
 
# create heatmap
# may need to increase size of plots window in R studio
library(gplots)
heatmap.2(logos$itemsOrd, Rowv=FALSE, Colv=FALSE, 
          dendrogram="none", col=redblue(16), 
          key=T, keysize=1.5, density.info="none", 
          trace="none", labRow=NA)
 
library(ltm)
# two-parameter logistic model
fit<-ltm(logos$items ~ z1)
summary(fit)
 
# item characteristic curves
plot(fit)
 
# constrains slopes to be equal
fit2<-rasch(logos$items)
plot(fit2)
summary(fit2)      
 
library(NMF)
# generate a synthetic dataset with 
# 500 rows and three groupings of
# columns (1-10, 11-20, and 21-36)
n <- 500
counts <- c(10, 10, 16)
 
# no noise
V1 <- syntheticNMF(n, counts, noise=FALSE)
V1[V1>0]<-1
 
# with noise
V2 <- syntheticNMF(n, counts)
V2[V2>0]<-1
 
# produce heatmap with and without noise
heatmap.2(V1, Rowv=FALSE, Colv=FALSE, 
          dendrogram="none", col=redblue(16), 
          key=T, keysize=1.5, density.info="none", 
          trace="none", labRow=NA)
heatmap.2(V2, Rowv=FALSE, Colv=FALSE, 
          dendrogram="none", col=redblue(16), 
          key=T, keysize=1.5, density.info="none", 
          trace="none", labRow=NA)

Created by Pretty R at inside-R.org

The Statistician as Data Action Hero

2015-12-02T10:53:00.000-08:00

"If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools."

Leo Breiman

Statistical Modeling: The Two Cultures

Breiman ends the opening abstract of his "manifesto" to the statistical community with the above call to action. As a statistics professor at Berkeley, he was not an outsider.

Unfortunately, the warning was not heeded and now statistics laments with the President of the American Statistical Association asking "Aren't We Data Science?". Interestingly, one of the comments objects to a suggestion in this editorial that statisticians ought to learn R.

Clearly, someone has missed the point. It is not what you call yourself, who signs your paycheck, or the size of your data sets. It is your openness to disruptive innovation and your desire to make a difference. R provides the interface that makes this possible.

Breiman states it well at the end of an interview from 2001,

So I think if I were advising a young person today, I would have some reservations about advising him or her to go into statistics, but probably, in the end, I would say, “Take statistics, but remember that the great adventure of statistics is in gathering and using data to solve interesting and important real world problems.”

Statistical Models That Support Design Thinking: Driver Analysis vs. Partial Correlation Networks

2015-11-24T11:21:00.000-08:00

We have been talking about design thinking in marketing since Tim Brown's Harvard Business Review article in 2008. It might be easy for the data scientist to dismiss the approach as merely a type of brainstorming for new products or services. Yet, design issues do arise in data visualization where we are concerned with communicating our findings. However, my interest is model selection: Should the analyst select one statistical model over another because the user might find it more helpful in planning interventions or designing new products and services?

For example, the marketing manager who wants to retain current customers seeks guidance from customer satisfaction questionnaires filled with performance ratings and intentions to recommend or purchase again. Motivated by the desire to keep it simple, common practice tends to focus attention on only the most important "causes" of customer retention. As I noted in my first post, Network Visualization of Key Driver Analysis, a more complete picture can be revealed by a correlation graph displaying all the interconnections among all the ratings. The edges or links are colored green or red so that we know if the relationship is positive or negative. The thickest of the path indicates the strength of the correlation. But correlations measure total effects, both those that are direct and those obtained through associations with other ratings.

The designer of intervention strategies aimed at preventing churn could acquire additional insights from the partial correlation graph depicting the effects between all pairs of ratings controlling for all the other ratings in the model. While the correlation map reveals total effects, the partial correlation map removes all but the direct effects. The graph below was created using the R code from my first post to simulate a data set that mimics what is often found when airline passengers complete satisfaction surveys. Once the data were generated, the procedures outlined in my post Undirected Graphs When the Causality is Mutual were followed. The R code is listed at the end of this discussion.

We can pick any node, such as the one labeled "Satisfaction" in the middle of the right-hand side of the figure. A simple way of interpreting this graph is to think of Satisfaction as the dependent variable and the lines radiating from Satisfaction as the weights obtained from the regression of this node on the other 14 ratings. Clearly, overall satisfaction serves as an inclusive summary measure with so many pathways from so many other nodes. Each of the four customer service ratings (below Satisfaction and in light pink) adds its own unique contribution with the greatest impact indicated by the thickest green edge from the Service node. Moreover, Easy Reservation and Ticket Price plus Clean Aircraft with room for people and baggage make incremental improvements in Satisfaction.

The same process can be repeated for any node. Instead of a driver analysis that narrows our thinking to a single dependent variable and its highest regression weights, the partial correlation map opens us to the possibilities. If the goal was customer retention, then the focus would be on the Fly Again node. Recommend seems to have the strongest link to Fly Again. Can the airline induce repeat purchase by encouraging recommendation? What if frequent flyer miles were offered when others entered your name as a recommender? Such a proposal may not be practical in its current form, but the graph supports this type of design thinking.

Because there are no direct paths from the four service nodes to Fly Again, a driver analysis would miss the indirect connection through Satisfaction. And what of this link between Courtesy and Easy Reservation? Do customers infer a "friendly" personality trait that links their perceptions of the way they are treated when they buy a ticket and when they board the plane? Design thinkers would entertain such a possibility and test the hypothesis. Such "cascaded inferences" fill the graph for those willing to look. Perhaps many small and less costly improvements might combine to have a greater impact than concentrating on a single aspect? Encouraging passengers to check their bags would create more overhead storage without reconfiguring the airplane. Let the design thinking begin!

The discussion ends with the identification of "the most important" in a driver analysis. The network, on the other hand, invites creative thought. Isn't this the point of data science? What can we learn from the data? The answer is a good deal more than can be revealed by the largest coefficient in a single regression equation.

# Calculates Sparse Partial Correlation Matrix
sparse_matrix<-EBICglasso(cor(ratings), n=1000)
round(sparse_matrix,2)
 
# Plots results
gr<-list(1:4,5:8,9:12,13:15)
node_color<-c("lightgoldenrod","lightgreen","lightpink","cyan")
qgraph(sparse_matrix, fade = FALSE, layout="spring", groups=gr, 
       color=node_color,labels=names(ratings), label.scale=FALSE, 
       label.cex=1, node.width=.5, edge.width=.25, minimum=.05)

Created by Pretty R at inside-R.org

Mutually Exclusive Clusters Are Boxes within Which Consumers No Longer Fit

2015-11-08T14:08:00.000-08:00

Sometimes we force our categories to be mutually exclusive and exhaustive even as the boundaries are blurring rapidly.

Of course, I am speaking of cluster analysis and whether it makes sense to force everyone into one and only one of a set of discrete boxes. Diversity is diverse and requires a more expressive representation than possible in a game of twenty questions. "Is it this or that?" is inadequate when it is a little of this and a lot of that.

How do you classify a love seat? Is it a small sofa or a large chair for two people? Natural categories are not defined by all-or-none criteria. All birds do not possess the same degree of "birdness" (as shown below for the graded structure underlying the classification of birds). Some birds are more "bird" than other birds, and some mammals might be thought of as birds (bats) because they look and behave more like birds than typical mammals.

In an earlier post, I issued a warning that clusters may appear more separated in textbooks than in practice. I urged that we consider other representations for individual variation. Archetypes work because they reside at the periphery of the objects to be described with all the other species in-between (e.g., birds of prey, household pets in cages, winter migrants, evil dark birds, and white birds of peace). In marketing this is the realm of fans, fanatics and -philes in which it is so easy to visualize the extreme users and name everyone else as hybrid combinations of pure types. R makes the analysis doable. Moreover, with dimensions defined as contrasting ideals (liberal vs. conservative), archetypal analysis mimics hidden dimensions such as those from factor analysis and item response theory.

The Forces behind Diversity Do Not Yield Disjoint Clusters

I have found it best to begin with the forces generating diversity. Why doesn't one size fit all? As a marketer, I look toward consumer demand - whether it originates out of individual usage, preference or need or whether it is manufactured by providers introducing new products and services. I discover that demand is seldom contained within a single box. Some cable television viewers want to watch lots of sports, so let us place them in the sport segment. But they also want movies-on-demand, so we need four segments filling in the 2x2 for sports and movies-on-demand. Hopefully, they do not want to hear the business news because now we have 8 segments in our 2x2x2. As the consumer acquires greater control, the old segmentation scheme seems more and more forced as if we are holding onto a simpler world with everyone crammed into one of only a few silos.

Recommendation systems adopt a different metaphor with the marketplace partitioned by user collaboration and the chunking of offerings into micro-genres. Heterogeneity is seen as coevolving networks of consumers and what they buy. A handful of mutually exclusive boxes will not work in a market that is increasingly fragmenting.

When there are so many alternatives within easy reach of the internet, no consumer can attend to it all. The rows in our data matrix, each containing information from a single individual, become both longer with more options and sparser with limited attention. If we want to play in the marketplace of attention, we will need more than K-means or finite mixture models. Researchers will require even more - the type of easy access provided by R packages such as archetype and NMF.

With the appropriate statistical model, one can uncover such generating processes from the data matrix with consumers as rows and what they want or like in the columns. Using figurative language I have called this approach the "ecology of data matrices" and have suggested the need for biclustering. Yet, there is opposition since we are so accustomed to dividing the analysis of rows and columns into two separate procedures. Most cluster analyses input all the columns to calculate distances among the rows. Factor analysis starts with column correlations computed from data including every row. Biclustering, on the other hand, cares about sorting the cells into simultaneous groupings of row-column combinations. The data matrix gets divided into subspaces, possibly overlapping, with this community of consumers similar on only those groupings of variables.

The simplicity of boxes will not work with consumers in control and more available than any one buyer can attend to or know about. An underlying structure remains, but one defined by the joint interaction of rows and columns. Consumers with common needs and experiences are attract to the same purchase channels and learn about offerings from the same sources. This simultaneous clustering of rows and columns are the blocks from which consumers customized their own personal consumption patterns. Nothing forces the consumer to select only one building block. In fact, the opposite is more generally true for most of us play multiple roles (e.g., items purchased for work and play, self and others, necessities and gifts, and the list goes on). To capture such common practices, we need a clustering technique that does not impose a simplistic representation forcing consumers into boxes within which they no longer fit.

Clustering Customer Satisfaction Ratings

2015-11-01T15:13:00.000-08:00

We run our cluster analysis with great expectations, hoping to uncover diverse segments with contrasting likes and dislikes of the brands they use. Instead, too often, our K-means analysis returns the above graph of parallel lines indicating that the pattern of high and low ratings are the same for everyone but at different overall levels. The data come from the R package semPLS and look very much like what one sees with many customer satisfaction surveys.

I will not cover any specifics about the data, but instead refer you to earlier discussions of this dataset, first in a post showing its strong one-dimensional structure using biplots and later in an example of an undirected graph or Markov network displaying brand associations.

We will begin with the mean ratings for the four lines in the above graph and include a relatively small fifth segment in the last column with a different narrative. Ordering the 23 items from lowest to highest mean scores over the entire sample makes both the table below and the graph above easier to read.

	not at all	a little	some	a lot	pricey
	9%	27%	34%	21%	10%

FairPrice	4.1	5.1	6.4	8.2	5.6
BuyAgain	3.0	6.9	8.7	9.7	3.8
Responsible	4.3	6.2	6.9	8.1	7.0
GoodValue	4.8	5.7	7.3	8.6	7.0
ComplaintHandling	4.4	6.1	7.2	8.9	7.8
Fulfilled	4.8	6.0	7.5	8.6	7.8
IsIdeal	4.6	6.2	7.7	9.0	7.8
NetworkQuality	5.6	6.2	7.4	8.4	8.0
Recommend	3.8	6.7	8.4	9.6	7.2
ClearInfo	4.6	6.5	7.9	9.2	8.6
Concerned	5.3	6.4	8.1	9.0	8.1
QualityExp	6.0	7.1	7.6	8.6	7.9
CustomerService	4.8	6.7	8.0	9.3	8.5
MeetNeedsExp	6.1	7.1	7.3	8.7	8.4
GoWrongExp	7.0	6.2	7.5	8.5	8.5
Trusted	6.1	6.6	7.8	9.1	8.4
Innovative	5.8	7.4	8.1	9.2	8.2
Reliability	6.1	6.8	7.9	9.2	8.7
RangeProdServ	6.2	7.1	8.0	9.2	8.3
Stable	7.1	6.7	7.8	9.1	8.3
OverallQuality	6.3	7.0	8.2	9.2	8.5
ServiceQuality	6.3	7.1	7.9	9.4	8.6
OverallSat	6.4	7.3	8.2	8.9	8.7

You can pick any row in this table and see that the first four segments with 90% of the customers are ordered the same. The first cluster is simply not at all happy with their mobile phone provider. They give the lowest Buy Again and Recommend ratings. In fact, with only two small exceptions, they uniformly give the lowest scores. For every row the second column is larger (note the two discrepancies already mentioned), followed by an even bigger third column, and then the most favorable fourth column. Successful brands have loyal customers, and at least one out of five customers in this data have "a lot" of love with a mean ratings of 9.7 on a 10-point scale.

You can see why I labeled these four segments with names suggesting differing levels of attraction. Each group has the same profile, as can be seen in the largely parallel lines on our graph. The good news for our providers is that only 9% are definitely at risk. The bad news is that another 10% like the product and the service but will not buy again, perhaps because the price is not perceived as fair (see their graph below with a dip for the second variable, Buy Again, and a much lower score than expected for the first variable, Fair Price, given the elevation of the rest of the curve).

Some might argue that what we are seeing is merely a measurement bias reflecting a propensity among raters to use different portions of the scale. Does this mean that 90% of the customers have identical experiences but give different ratings due to some scale-usage predisposition? If it is a personality trait, does this mean that they use the same range of scale values to rate every brand and every product? Would we have seen individuals using the same narrow range of scores had the items been more specific and more likely to show variation, for example, if they had asked about dropped calls and dead zones rather than network quality?

Given questions without any concrete referent, the uniform patterns of high and low ratings across the items are shaped by a network of interconnected perceptions resulting from a common technology and a shared usage of that technology. In addition, one overhears a good deal of discussion about the product category in the media and from word-of-mouth so that even a nonuser might be aware of the pros and cons. As a result, we tend to find a common ordering of ratings with some customers loving it all "a lot" and others "not at all." Unless customers can provide a narrative (e.g., "I like the product and service, but it costs too much"), they will all reproduce the same profile of strengths and weaknesses at varying levels of overall happiness. That is, satisfied or not, almost everyone seems to rate value and price fairness lower than they score overall quality and satisfaction.

Finally, my two prior posts cited earlier may seem to paint a somewhat contradictory picture of customer satisfaction ratings. On the one hand, we are likely to find a strong first principal component indicating the presence of a single dimension underlying all the ratings. Customer satisfaction tends to be one-dimensional so that we might expect to observe the four clusters with parallel lines of ratings. Satisfaction falls for everyone as features and services become more difficult for any brand to deliver. On the other hand, the graph of the partial correlations suggests a network of interconnected pairs of ratings after controlling for the all the remaining items. One can identify regions with stronger relationships among items measuring quality, product offering, corporate citizenship, and loyalty.

Both appear to be true. Rating with the highest partial intercorrelations form local neighborhoods with thicker edges in our undirected graph. Although some nodes are more closely related, all the variables are still connected either directly with a pairwise edge or indirectly through a separating node. Everything is correlated, but some are more correlated than others.

R code needed to reproduce these tables and plots.

library("semPLS")
data(mobi)
 
# descriptive names for graph nodes
names(mobi)<-c("QualityExp",
               "MeetNeedsExp",
               "GoWrongExp",
               "OverallSat",
               "Fulfilled",
               "IsIdeal",
               "ComplaintHandling",
               "BuyAgain",
               "SwitchForPrice",
               "Recommend",
               "Trusted",
               "Stable",
               "Responsible",
               "Concerned",
               "Innovative",
               "OverallQuality",
               "NetworkQuality",
               "CustomerService",
               "ServiceQuality",
               "RangeProdServ",
               "Reliability",
               "ClearInfo",
               "FairPrice",
               "GoodValue")
 
# kmeans with 5 cluster and 25 random starts
kcl5<-kmeans(mobi[,-9], 5, nstart=25)
 
# cluster profiles and sizes
cluster_profile<-t(kcl5$centers)
cluster_size<-kcl5$size
 
# row and column means
row_mean<-apply(cluster_profile, 1, mean)
col_mean<-apply(cluster_profile, 2, mean)
 
# Cluster profiles ordered by row means
# columns sorted so that 1-4 are increasing means
# and the last column has low only for buyagain & fairprice
# Warning: random start values likely to yield different order
sorted_profile<-cluster_profile[order(row_mean),c(4,3,1,2,5)]
 
# reordered cluster sizes and profiles
cluster_size[c(4,3,1,2,5)]/250
round(sorted_profile,2)
 
# plots for first 4 clusters
matplot(sorted_profile[,-5], type = c("b"), pch="*", lwd=3,
        xlab="23 Brand Ratings Ordered by Average for Total Sample",
        ylab="Average Ratings for Each Cluster")
title("Loves Me Little, Some, A Lot, Not At All")
 
# plot of last cluster
matplot(sorted_profile[,5], type = c("b"), pch="*", lwd=3,
        xlab="23 Brand Ratings Ordered by Average for Total Sample",
        ylab="Average Ratings for Last Cluster")
title("Got to Switch, Costs Too Much")

Created by Pretty R at inside-R.org

Graphical Modeling Mimics Selective Attention: Customer Satisfaction Ratings

2015-10-28T10:23:00.000-07:00

As shown by the eye tracking lines and circles, there is more on the above screenshot than we can process simultaneously. Visual perception takes time, and we must track where the eye focuses by recording sequence and duration. The "50% off" and the menu items seems to draw the most attention, suggesting that the viewers were not men.

But what if the screen contained a correlation matrix?

The 23 mobile phone customer satisfaction ratings from an earlier post will serve as an illustration. The R code to access the data, calculate the correlation matrix and produce the graph can be found at the end of this blog entry.

All the correlations are positive, so we might tend to focus on the highest correlated pairs and then search for triplets with uniformly larger intercorrelations. Although 23x23 is not a particularly big matrix, it contains enough entities that uncovering a pattern is a difficult task.

Factor analysis is an option, yet it might impose more structure than desired. What if we believe that "overall quality" is an abstraction created from perceptions of the reliability and stability of the network and supportive services and not indicators reflecting the hidden presence of a latent quality dimension? That is, we want to maintain all those individual ratings and their separate pairwise connections as shown in the correlation matrix. Well, a graph might assist those of us with a short span of attention, for example, an undirected graph whose nodes are the individual ratings and whose edges represent the correlations between pairs of ratings.

The green indicate positive values, and the largest correlations have the thickest paths. Though everything is interconnected, the graph aligns neighbors with the strongest connections. One could track your eye movements as you attempt to discover some spatial organization: overall satisfaction centered amidst quality and service with retention and recommendation pulled toward good value and fair price. Alternatively, you might have noted four regions: quality toward the right, innovative and range of products/services near the top, service and support on the top-right side, and the final word on value and loyalty in the bottom-right corner.

Hopefully, the eye tracking analogy clarifies that the interpretative process involved in making sense of the graph mimics the graphical modeling that factors or decomposes a complex network into groupings of local relationships. There is just too many pairwise relationships for the graph or the person to assimilate it all in a single glance. Selective attention deconstructs the perceptual field, in this case, mobile phone customer experiences with their cellular providers.

Of course, all decompositions are not equally helpful for customers deciding whether or not to continue with their current product and provider. We must remember that our consumer is not alone and product purchase is not a solitary quest. In order to understand product reviews, marketing communications and user comments, the consumer tends to adopt the prevailing factorization shared by most in the market and shown in the above graph.

Finally, we are not choosing a factor analysis model because we do not believe in a directed graphical representation with hidden latent constructs generating the satisfaction ratings. To be clear, we could have run a factor analysis and identified factors. The factors, however, would be derivative and not generative. The undirected graph preserves the primacy of the separate ratings and represents the factors in the edges as local regions of higher connectivity.

R code to read the data, print the correlation matrix, and plot the correlation network map.

library("semPLS")
data(mobi)
 
# descriptive names for graph nodes
names(mobi)<-c("QualityExp",
               "MeetNeedsExp",
               "GoWrongExp",
               "OverallSat",
               "Fulfilled",
               "IsIdeal",
               "ComplaintHandling",
               "BuyAgain",
               "SwitchForPrice",
               "Recommend",
               "Trusted",
               "Stable",
               "Responsible",
               "Concerned",
               "Innovative",
               "OverallQuality",
               "NetworkQuality",
               "CustomerService",
               "ServiceQuality",
               "RangeProdServ",
               "Reliability",
               "ClearInfo",
               "FairPrice",
               "GoodValue")
 
# prints the correlation matrix
round(cor(mobi[,-9]),2)
 
# plots the correlation network map
library("qgraph")
qgraph(cor(mobi[,-9]), layout="spring",
       labels=names(mobi[-9]), label.scale=FALSE,
       label.cex=1, node.width=.5, minimum=.3)

Created by Pretty R at inside-R.org

The Network Underlying Consumer Perceptions of the European Car Market

2015-10-13T14:19:00.000-07:00

The nodes have been assigned a color by the author so that the underlying distinctions are more pronounced. Cars that are perceived as Economical (in aquamarine) are not seen as Sporty or Powerful (in cyan). The red edges connecting these attributes indicate negative relationships. Similarly, a Practical car (in light goldenrod) is not Technically Advanced (in light pink). This network of feature associations replicates both the economical to luxury and the practical to advanced differentiations so commonly found in the car market. North Americans living in the suburbs may need to be reminded that Europe has many older cities with less parking and narrower streets, which explains the inclusion of the city focus feature.

The data come from the R package plfm, as I explained in an earlier post where I ran a correspondence analysis using the same dataset and where I described the study in more detail. The input to the correspondence analysis was a cross tabulation of the number of respondents checking which of the 27 features (the nodes in the above graph) were associated with each of 14 different car models (e.g., Is the VW Golf Sporty, Green, Comfortable, and so on?).

I will not repeat those details, except to note that the above graph was not generated from a car-by-feature table with 14 car rows and 27 feature columns. Instead, as you can see from the R code at the end of this post, I reformatted the original long vector with 29,484 binary entries and created a data frame with 1092 rows, a stacking of the 14 cars rated by each of the 78 respondents. The 27 columns, on the other hand, remain binary yes/no associations of each feature with each car. One can question the independence of the 1092 rows given that respondent and car are grouping factors with nested observations. However, we will assume, in order to illustrate the technique, that cars were rated independently and that there is one common structure for the 14-car European market. Now that we have the data matrix, we can move on to the analysis.

As in the last post, we will model the associative net underlying these ratings using the IsingFit R package. I would argue that it is difficult to assert any causal ordering among the car features. Which comes first in consumer perception, Workmanship or High Trade-In Value? Although objectively trade-in value depends on workmanship, it may be more likely that the consumer learns first that the car maintains its value and then infers high quality. A possible resolution is to treat each of the 27 nodes as a dependent variable in their own regression equation with the remaining nodes as predictors. In order to keep the model sparse, IsingFit fits the logistic regressions with the R package glmnet.

For instance, when Economical is the outcome, we estimate the impact of the other 26 nodes including Powerful. Then, when Powerful is the outcome, we fit the same type of model with coefficients for the remaining 26 features, one of which is Economical. There is nothing guaranteeing that the two effects will be the same (i.e., Powerful's effect on Economical = Economical's effect on Powerful, controlling for all the other features). Since an undirected graph needs a symmetric affinity matrix as input, IsingFit checks to determine if both coefficients are nonzero (remember that sparse modeling yields lots of zero weights) and then averages the coefficients when Economical is in the Powerful model and Powerful is in the Economic model (called the AND rule).

Hastie, Tibshirani and Wainwright refer to this approach as "neighborhood-based" in their chapter on graph and model selection. Two nodes are in the same neighborhood when mutual relationships remain after controlling for everything else in the model. The red edge between Economical and Powerful indicates that each was in the other's equation and that their average was negative. IsingFit output the asymmetric weights in a data matrix called asymm.weights (Res$weiadj is symmetric after averaging). It is always a good idea to check this matrix and determine if we are justified in averaging the upper and lower triangles.

It should be noted that the undirected graph is not a correlation network because the weighted edges represent conditional independence relationships and not correlations. You need only go back to the qgraph() function and replace Res$weiadj with cor(rating) or cor_auto(rating) in order to plot the correlation network. The qgraph documentation explains how cor_auto() checks to determine if a Pearson correlation is appropriate and substitutes a polychoric when all the variables are binary.

Sacha Epskamp provides a good introduction to the different types of network maps in his post on Network Model Selection Using qgraph. Larry Wasserman covers similar topics at an advanced level in this course on Statistical Machine Learning. There is a handout on Undirected Graphical Models along with two YouTube video lectures (#14 and #15). Wasserman raises some concerns about our ability to estimate conditional independence graphs when the data does not have just the right dependence structure (not too much and not too little), which is an interesting point-of-view given that he co-teaches the class with Ryan Tibshirani, whose name is associated with the lasso and sparse modeling.

# R code needed to reproduce the undirected graph
library(plfm)
data(car)
 
# car$data$rating is length 29,484
# 78 respondents x  14 cars x 27 attributes
# restructure as a 1092 row data frame with 27 columns
rating<-data.frame(t(matrix(car$data$rating, nrow=27, ncol=1092)))
names(rating)<-colnames(car$freq1)
 
# fits conditional independence model
library(IsingFit)
Res <- IsingFit(rating, family='binomial', plot=FALSE)
 
# Plot results:
library("qgraph")
# creates grouping of variables to be assigned different colors.
gr<-list(c(1,3,8,20,25), c(2,5,7,23,26), c(4,10,16,17,21,27), 
         c(9,11,12,14,15,18,19,22))
node_color<-c("aquamarine","lightgoldenrod","lightpink","cyan")
qgraph(Res$weiadj, fade = FALSE, layout="spring", groups=gr, 
       color=node_color, labels=names(rating), label.scale=FALSE, 
       label.cex=1, node.width=.5)

Created by Pretty R at inside-R.org

The Graphical Network Associated with Customer Churn

2015-10-08T11:37:00.000-07:00

The node representing "Will Not Stay" draws our focus toward the left side of the following undirected graph. Customers of a health care insurance provider were asked about their intentions to renew at the next sign-up period. We focus on those indicating the greatest potential for defection by creating a binary indicator separating those who say they will not stay from everyone else. In addition, before telling us whether or not they intended to switch health care providers, these customers were given a checklist and instructed to check all the events that recently occurred (e.g., price increases, higher prescription costs, provider not covering all expenses, hospital and doctor visits, and customer service contacts).

We should note that all we have are customer perceptions. There is no electronic record of price increases, claim rejections, direct billings by MDs or hospitals, a customer service contact, or doctor and hospital visits. That is, we do not have measures of the event occurrences that are independent of defection intention. Consequently, we have no justification for drawing an arrow from Premiums Increases to Will Not Stay because the decision to churn impacts the willingness to check the Premium Up box. For example, everyone in the United States is likely to see some increase in their premiums, yet your willingness to check "yes" may depend on what else has occurred in your relationship with the insurance provider. Those wanting to remain dismiss the price increase as inflation or reframe it as essentially the same price, while those thinking of flight are more likely to take notice and affront. It might help to think of this as a form of cognitive dissonance or simply selective attention. Regardless of the specifics of the cognitive and affective processes, the result is an undirected graph with every node is both an outcome and a predictor.

The thickness of the lines indicate the strength of the connections. These edges represent the relationship between nodes controlling for all the other nodes in the graph. A checklist was provided so that all we have is a data matrix with either yes (=1) or no (=0). As I explained above, the only rating scale was dichotomized into Will Not Stay versus any other response. The data are proprietary so that all I can tell you is that there were more than a thousand customers, and each row was a profile of 11 binary variables coded zero or one. On the other hand, I can share the four lines of R code needed to run the analysis using the IsingFit R package and a data frame called "events2" with 11 columns and lots of rows containing only zeros and ones (see the end of this post). In addition, I can provide the link to an comprehensive overview of the methodology, A New Method for Constructing Networks from Binary Data. Those seeking more will find the notes from Sacha Epskamp workshop very helpful.

Getting back to our network, it seems that when Premiums go up, so do Deductibles and Co-pays. Cost increases form a clique near the bottom of the graph with edges suggesting that anticipated defection co-varies with price increases. A similar effects can be seen for prescription costs near the top. However, nothing seems to encourage exit more than a provider's failure to pay. Or, at least those who will not stay checked the box associated with the provider not paying. Moreover, we can observe some separation and independence in this undirected graph. Visiting your doctor, a specialist or going to the hospital have positive connections to customer churn only through the receipt of a bill or a customer service contact.

Hopefully, this example demonstrates that a lot can be learned from a undirected graphical representation of dichotomous survey data. Bayesian networks, more correctly called directed graphs, seem to attract a good deal of attention in marketing (e.g., BayesiaLab), as do structural equations models (see my previous post on Undirected Graphs When the Causality Is Mutual). In fact, my first post in this blog, Network Visualization of Key Driver Analysis, demonstrates how much can be summarizes quickly and clearly in an undirected graph. Another post, Metaphors Matters, compares factor analysis and correlation network maps.

To be clear, a graph displays an adjacency matrix that can contain any measure, often an index of association, affiliation or affinity. Any similarity or distance matrix can be graphed. Thus, we need to be careful when we interpret the resulting graphs. In this case, the adjacency matrix contained the averaged coefficients from a sparse logistic regression with each node as the dependent variable and all the remaining nodes as predictors. This means that our graph is not a correlation network because the adjacency matrix does not contain correlations. It is more like a partial correlation network, except that the adjacency matrix does not contain partial correlations but something that can be interpreted like a partial correlation. Fortunately, you can work with the graph as representing the relationship between two nodes controlling for the rest while you learn the details of Ising discrete data graphing.

### Fit using IsingFit ###
library(IsingFit)
Res <- IsingFit(events2, family='binomial', plot=FALSE)
 
# Plot results:
library("qgraph")
qgraph(Res$weiadj, fade = FALSE, layout="spring", 
       labels=names(events2), label.scale=FALSE, 
       label.cex=1, node.width=.5)

Created by Pretty R at inside-R.org

Undirected Graphs When the Causality Is Mutual

2015-10-05T09:27:00.000-07:00

Structural equation models impose causal order on a set of observations. We start with a measurement model: a list of theoretical constructs and a table assigning what is observed (manifest) to what is hidden (latent). Although it is possible to think of this assignment as formative rather than reflective, the default is a causal connection with the latent variables responsible for the observed scores. Next, we draw arrows specifying the cause and effect relationships among the latent variables. All of this is shown in great detail with a customer satisfaction example in the very well-written vignette for the R package semPLS, which uses partial least squares (PLS) to fit structural equations models (sem).

Your focus should be on the causal model and not the estimation technique. PLS is optional, and all the parameters can be estimated using maximum likelihood with the lavaan R package. However, you can get access to the dataset through the semPLS package, and you will not find a better description of this particular example or the steps involved in specifying and testing a SEM.

As always, there are issues. An earlier post raises a number of concerns with this tale of causal links suggesting that we might be asked to assume too much when we impose a directionality on mutually interacting components. For example, when it requires effort to change product or service providers, it might be easier to believe that all competitors are the same and that it is futile to seek a better deal elsewhere. Here, the decision to Buy Again encourages us to rethink our dissatisfaction and raise the ratings over that which would have been given had switching been easier. Such mutual dependencies is represented by undirected graphs, and for social scientists, the R package qgraph provides an introduction.

My goal in this post is a modest one: to demonstrate that one can learn a great deal from a series of customer ratings without needing to force the data into a causal model. This is achieved by examining the following partial correlation network.

You should recall that a graph is a visual display of some adjacency matrix. In this case we define adjacency as the partial correlation between two nodes after controlling for all the other nodes in the graph. Actually, our adjacency matrix is a bit more complicated because we applied the graphical lasso to obtain our estimates. The details are important, yet one can learn a great deal from the graph knowing little more than that the edges show us conditional association after removing the other nodes and that we have made some effort to eliminate as many edges as possible (a sparse undirected graph).

All the R code needed to replicate this analysis appears at the end of this post. One of the original 24 items, # 9 SwitchforPrice, was removed because it had no edge to any of the other nodes in this partial correlation network (the semPLS documentation reveals that the question had a unique format).

One way to start is to identify the thickest edges connecting the remaining 23 customer perception, satisfaction and loyalty ratings. Unsurprisingly, good value and fair price "hang together" since endorsing one and rejecting the other would seem to be a contradiction. Similarly, stability is a key component of network quality, reliability defines service quality, and we do not recommend that which we are unwilling to buy again. These single edges connecting two ratings with common meanings may not be that informative.

What is interesting, however, is that we can read "the customer's mind" from the structure of the undirected graph. First, all the quality measures form a grouping toward the left of the graph: stable, network quality, reliability, service quality, and overall quality. As we move toward the right, we encounter overall satisfaction along with its companion positive perceptions of trusted and fulfilled. In the region just above fall the product and service attributes with range of products and services, innovative, and customer service. Corporate responsibility is more toward the left with the loyalty measures below (e.g., buy again and recommend).

In general, expectations (go wrong, quality, and meet needs) are toward the top and behaviors near the bottom (compliant handling, recommend, and buy again). The most basic quality indicators are found on the left with the extras, such as good citizenship, appearing on the right (concerned, responsible, fair price, and good value).

Over time, customers form impressions and reach conclusions about the companies providing them goods and services. These attributions are mutually supportive and create a system of interdependencies that seeks an equilibrium. Disturbing that equilibrium anywhere within the system will have its consequences. A company that provides small incentives to current customers in order to encourage them to recruit new customers gets both the new customers and recommending customers with higher satisfaction and improved impressions. Recommendation is more than the result of a sequential causal process with satisfaction as an input. The incentive is an intervention with satisfaction as the outcome. The causality is mutual.

library("semPLS")
data(mobi)
 
# descriptive names for graph nodes
names(mobi)<-c("QualityExp",
              "MeetNeedsExp",
              "GoWrongExp",
              "OverallSat",
              "Fulfilled",
              "IsIdeal",
              "ComplaintHandling",
              "BuyAgain",
              "SwitchForPrice",
              "Recommend",
              "Trusted",
              "Stable",
              "Responsible",
              "Concerned",
              "Innovative",
              "OverallQuality",
              "NetworkQuality",
              "CustomerService",
              "ServiceQuality",
              "RangeProdServ",
              "Reliability",
              "ClearInfo",
              "FairPrice",
              "GoodValue")
 
library("qgraph")
 
# Calculates Sparse Partial Correlation Matrix
sparse_matrix<-EBICglasso(cor(mobi[,-9]), n=250)
 
# Plots results
ug<-qgraph(sparse_matrix, layout="spring", 
           labels=names(mobi[-9]), label.scale=FALSE,
           label.cex=1, node.width=.5)

Created by Pretty R at inside-R.org

Matrix Factorization Comes in Many Flavors: Components, Clusters, Building Blocks and Ideals

2015-08-06T17:25:00.000-07:00

Unsupervised learning is covered in Chapter 14 of The Elements of Statistical Learning. Here we learn about several data reduction techniques including principal component analysis (PCA), K-means clustering, nonnegative matrix factorization (NMF) and archetypal analysis (AA). Although on the surface they seem so different, each is a data approximation technique using matrix factorization with different constraints. We can learn a great deal if we compare and contrast these four major forms of matrix factorization.

Robert Tibshirani outlines some of these interconnections in a group of slides from one of his lectures. If there are still questions, Christian Thurau's YouTube video should provide the answers. His talk is titled "Low-Rank Matrix Approximations in Python," yet the only Python you will see is a couple of function calls that look very familiar. R, of course, has many ways of doing K-means and principal component analysis. In addition, I have posts showing how to run nonnegative matrix factorization and archetypal analysis in R.

As a reminder, supervised learning also attempts to approximate the data, in this case the Ys given the Xs. In multivariate multiple regression, we have many dependent variables so that both Y and B are matrices instead of vectors. The usual equation remains Y = XB + E, except that Y, B and E are all matrices with as many rows as the number of observations and as many columns as the number of outcome variables. The error is made as small as possible as we try to reproduce our set of dependent variables as closely as possible from the observed Xs.

K-means and PCA

Without predictors we lose our supervision and are left to search for redundancies or patterns in our Ys without any Xs. We are free to test alternative data generation processes. For example, can variation be explained by the presence of clusters? As shown in the YouTube video and the accompanying slides from the presentation, the data matrix (V) can be reproduced by the product of a cluster membership matrix (W) and a matrix of cluster centroids (H). Each row of W contains all zeros except for a single one that stamps out that cluster profile. With K-means, for instance, cluster membership is all-or-none with each cluster represented by a complete profile of averages calculated across every object in the cluster. The error is the extent that the observations in each grouping differs from their cluster profile.

Principal component analysis works in a similar fashion, but now the rows of W are principal component scores and H holds the principal component loadings. In both PCA and K-means, V = WH but with different constraints on W and H. W is no longer all zeros except for a single one, and H is not a collection of cluster profiles. Instead, H contains the coefficients defining an orthogonal basis for the data cloud with each successive dimension accounting for a decreasing proportion of the total variation, and W tells us how much each dimension contributes to the observed data for every observation.

An early application to intelligence testing serves as a good illustration. Test scores tend to be correlated positively so that all the coefficients in H for the first principal component will be positive. If the tests include more highly intercorrelated verbal or reading scores along with more highly intercorrelated quantitative or math scores, then the second principal component will be bipolar with positive coefficients for verbal variables and negative coefficients for quantitative variables. You should note that the signs can be reversed for any row of H for such reversal only changes direction. Finally, W tells us the impact of each principal component on the observed test scores in data matrix V.

Smart test takers have higher first principal components that uniformly increase all the scores. Those with higher verbal than quantitative skills will also have higher positive values for their second principal component. Given its bipolar coefficients, this will raise the scores on the verbal test and lower the scores on the quantitative tests. And that is how PCA reproduces the observed data matrix.

We can use the R package FactoMineR to plot the features (columns) and objects (rows) in the same space. The same analysis can be performed using the biplot function in R, but FactoMineR offers much more and supports it all with documentation. I have borrowed these two plot from an earlier post, Using Biplots to Map Cluster Solutions.

FactoMineR separates the variables and the individuals in order not to overcrowd the maps. As you can see from the percent contributions of the two dimensions, this is the same space so that you can overlay the two plots (e.g., the red data points are those with the highest projection onto the Floral and Sweetness vectors). One should remember that vector spaces are shown with arrows, and scores on those variables are reproduced as orthogonal projections onto each vector.

The prior post attempted to show the relationship between a cluster and a principal component solution. PCA relies on a "new" dimensional space obtained through linear combinations of the original variables. On the other hand, clusters are a discrete representation. The red points in the above individual factor map are similar because they are of the same type with any differences among these red dots due to error. For example, sweet and sour (medicinal on the plot) are taste types with their own taste buds. However, sweet and sour are perceived as opposites so that the two clusters can be connected using a line with sweet-and-sour tastes located between the extremes. Dimensions always can be reframed as convex combinations of discrete categories, rendering the qualitative-quantitative distinction somewhat less meaningful.

NMF and AA

It may come as no surprise to learn that nonnegative matrix factorization, given it is nonnegative, has the same form with all the elements of V, W, and H constrained to be zero or positive. The result is that W becomes a composition matrix with nonzero values in a row picking the elements of H as parts of the whole being composed. Unlike PCA where H may represent contrasts of positive and negative variable weights, H can only be zero or positive in NMF. As a result, H bundles together variables to form weighted composites.

The columns of W and the rows of H represent the latent feature bundles that are believed to be responsible for the observed data in V. The building blocks are not individual features but weighted bundles of features that serve a common purpose. One might think of the latent bundles using a "tools in the toolbox" metaphor. You can find a detailed description showing each step in the process in a previous post and many examples with the needed R code throughout this blog.

Archetypal analysis is another variation on the matrix factorization theme with the observed data formed as convex combinations of extremes on the hull that surrounds the point cloud of observations. Therefore, the profiles of these extremes or ideals are the rows of H and can be interpreted as representing opposites at the edge of the data cloud. Interpretation seems to come naturally since we tend to think in terms of contrasting ideals (e.g., sweet-sour and liberal-conservative).

This is the picture used in my original archetypal analysis post to illustrate the point cloud, the variables projected as vectors onto the same space, and the locations of the 3 archetypes (A1, A2, A3) compared with the placement of the 3 K-means centroids (K1, K2, K3). The archetypes are positioned as vertices of a triangle spanning the two-dimensional space with every point lying within this simplex. In contrast, the K-means centroids are pulled more toward the center and away from the periphery.

Why So Many Flavors of Matrix Factorization?

We try to make sense of our data by understanding the underlying process that generated that data. Matrix factorization serves us well as a general framework. If every variable was mutually independent of all the rest, we would not require a matrix H to extract latent variables. Moreover, if every latent variable had the same impact for every observation, we would not require a matrix W holding differential contributions. The equation V = WH represents that the observed data arises from two sources: W that can be interpreted as if it were a matrix of latent scores and H that serves as a matrix of latent loadings. H defines the relationship between observed and latent variables. W represents the contributions of the latent variables for every observation. We call this process matrix factorization or matrix decomposition for obvious reasons.

Each of the four matrix factorizations adds some type of constraint in order to obtain a W and H. Each constraint provides a different view of the data matrix. PCA is a variance-maximizers yielding a set of components accounting for the most variation independent of all preceding components. K-means gives us boxes with minimum variation within each box. We get building blocks and individualized rules of assembly from NMF. Finally, AA frames observations as compromises among ideals or archetypes. The data analyst must decide which story best fits their data.

Sensemaking in R: A Plenitude of Models Makes for Good Storytelling

2015-08-03T10:00:00.000-07:00

"Sensemaking is a motivated, continuous effort to understand connections (which can be among people, places, and events) in order to anticipate their trajectories and act effectively."

- Gary Klein, Brian Moon & Robert Hoffman

Making Sense of Sensemaking 1 (2006)

Story #1: A Tale of Causal Links

A causal model can serve as a sensemaking tool. I have reproduced below a path diagram from an earlier post organizing a set of customer ratings based on their hypothesized causes and effects. As shown on the right side of the graph, satisfaction comes first and loyalty follows with input from image and complaints. Value and Quality perceptions are positioned as drivers of satisfaction. Image seems to be separated from product experience and causally prior. Of course, you are free to disagree with the proposed causal structure. All I ask is that you "see" how such a path diagram can be imposed on observed data in order to connect the components and predict the impact of marketing interventions.

Actually, the nodes are latent variables, and I have not drawn in the measurement model. The typical customer satisfaction questionnaire has many items tapping each construct. In my previous post referenced above, I borrowed the mobile phone dataset from the R package semPLS, where loyalty was assessed with three ratings: continued usage, switching to lower price competitor, and likelihood to recommend. These items are seen as indicators of a commitment and attachment, and their intercorrelations are due to their common cause, which we have labeled as Loyalty.

Where Do Causal Models Come From? The data were collected at one point in time, but it is difficult not to impose a learning sequence on the ratings. That is, the analyst overlays the formation process onto the data as if the measurements were made as learning occurred. Brand image is believed to be acquired first and expectation thought to be formed before the purchase is made. Product experience is understood to come next in the sequence, followed by an evaluation and finally the loyalty decisions to continue using and recommend to others.

As I argued in the prior post, causation is not in the data because the ratings were not gathered over time. By the time the questionnaire is seen, dissonance has already worked its way backward creating consistencies in the ratings. For instance, when switching is a chore, satisfaction and product perceptions are all higher than they would have been had changing providers been an easier task. In a similar manner, reluctantly recommending only when forced for your opinion may reverse the direction of the arrows and at least temporarily raise all ratings. We shall see in the next section how ratings are interconnected by a network of consumer inferences reflecting not observed covariation but belief and semantics.

Story #2: Living on a One-Dimensional Love-Hate Manifold (Halo Effects)

Our first sensemaking tool, structural equation modeling, was shaped by an intricate plot with many characters playing fixed causal roles. Few believe that this is the only way to make sense of the connections among the different ratings. For some, including myself, the causal model seems a bit too rational. What happened to affect? Halo effects are thought of as a cognitive bias, but all summaries introduce bias measured by the variation about the centroid. In the case of customer satisfaction and loyalty, a pointer on a single evaluative dimension can reproduce all the ratings. You tell me that you are very satisfied with your mobile phone provider, and I can predict that you are not dropping a lot of calls.

The halo effect functions as a form of data comprehension. We learn what constitutes a "good" product or service before we buy. These are the well-formed performance expectations that serve as the tests for deciding satisfaction. We are upset when the basic functions that are must-haves are not delivered (e.g., failure of our mobile phone to pair with the car's Bluetooth), and we are delighted when extras are included that we did not expect (e.g., responsive customer support). Most of these expectations lie just below awareness until experienced (e.g., breakage and repair costs when dropped short distance or onto relatively soft surface).

This representation orders features and services as milestones along a single dimension so that one can read one's overall satisfaction from their position along this path. You may be familiar with the usage of such sensemaking tools in measuring achievement (e.g., spelling ability is assessed by the difficulty of words that one can spell) or political ideology (e.g., a legislator's position along the liberal-conservative continuum depends on the bills voted for and against). Thus, I assess your spelling ability by the difficulty of the words you can spell. I determine how liberal or conservative you are by the issues you support or oppose. And I evaluate brands and their products by the features and services they are able to provide. We simply reanalyze the same customer satisfaction rating data. The graded response model from the ltm R package will order both customers and the rating items along the same latent satisfaction dimension, as shown in my post Item Response Modeling of Customer Satisfaction.

Perhaps you noticed that we have changed our perspective or shifted to a new paradigm. Feature ratings are no longer drivers of satisfaction, instead they have become indicators of satisfaction. In Story #1, a Tale of Causal Links, the arrows go from the features to satisfaction and loyalty. Driver analysis accumulates satisfaction feature by feature with each adding a component to the overall reservoir of goodwill. However, in Story #2 all the ratings (features, satisfaction, and loyalty) fall along the same evaluative continuum from rage to praise. We can still display the interrelationship with a diagram, thought we need to drop the arrows for everything is interconnected in this network.

The manifold from Story #2 makes sense of the data by ranking features based on performance expectations. Some features and services are basic and everyone scores well. The premium features and services, on the other hand, are those not provided by every product. Customers decide what they want and are willing to pay, and then they assess the ability of the purchased product to deliver. This is not a driver analysis for the assessment of each component is not independent of the other components.

Those of us willing to live with the imperfections of our current product tend to rate the product higher in a backward adjustment from loyalty to feature performance. You do something similar when you determine that switching is useless because all the competitors are the same. Can I alter your perceptions by tempting you with a $100 bonus or a free month of service to recommend a friend? It's a network of jointly determined nodes with a directionality represented by the love-hate manifold. The ability to generate satisfaction or engender loyalty is but another node, different from product quality perceptions, yet still part of the network.

How else can you explain how randomly attaching a higher price to a bottle of wine yields higher ratings for taste? Price changes consumer perceptions of quality because consumers make inferences about uncertain features based on what they know about more familiar features. When asked about customer support, you can answer even if you have never contacted or used customer support. You simply fill in a rating with an inference from other features with which you are more familiar or you simply assume it must be good or bad because you are happy or unhappy overall. Such a network analysis can be done with R, as can the driver analysis from our first story.

But I Don't Want to Be a Statistician!

2015-07-29T16:42:00.000-07:00

"For a long time I have thought I was a statistician.... But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.... All in all, I have come to feel that my central interest is in data analysis...."

Opening paragraph from John Tukey "The Future of Data Analysis" (1962)

To begin, we must acknowledge that these labels are largely administrative based on who signs your paycheck. Still, I prefer the name "data analysis" with its active connotation. I understand the desire to rebrand data analysis as "data science" given the availability of so much digital information. As data has become big, it has become the star and the center of attention.

We can borrow from Breiman's two cultures of statistical modeling to clarify the changing focus. If our data collection is directed by a generative model, we are members of an established data modeling community and might call ourselves statisticians. On the other hand, the algorithmic modeler (although originally considered a deviant but now rich and sexy) took whatever data was available and made black box predictions. If you need a guide to applied predictive modeling in R, Max Kuhn might be a good place to start.

Nevertheless, causation keeps sneaking in through the back door in the form of causal networks. As an example, choice modeling can be justified as an "as if" predictive modeling but then it cannot be used for product design or pricing. As Judea Pearl notes, most data analysis is "not associational but causal in nature."

Does an inductive bias or schema predispose us to see the world as divided into causes and effects with features creating preference and preference impacting choice? Technically, the hierarchical Bayes choice model does not require the experimental manipulation of feature levels, for example, reporting the likelihood of bus ridership for individuals with differing demographics. Even here, it is difficult not be see causation at work with demographics becoming stereotypes. We want to be able to turn the dial, or at least selection different individuals, and watch choices change. Are such cognitive tendencies part of statistics?

Moreover, data visualization has always been an integral component in the R statistical programming language. Is data visualization statistics? And what of presentations like Hans Rosling's Let My Dataset Change Your Mindset? Does statistics include argumentation and persuasion?

Hadley Wickham and the Cognitive Interpretation of Data Analysis

You have seen all of his data manipulation packages in R, but you may have missed the theoretical foundations in the paper "A Cognitive Interpretation of Data Analysis" by Grolemund and Wickham. Sensemaking is offered as an organizing force with data analysis as an external tool to aid understanding. We can make sensemaking less vague with an illustration.

Perceptual maps are graphical displays of a data matrix such as the one below from an earlier post showing the association between 14 European car models and 27 attributes. Our familiarity with Euclidean spaces aid in the interpretation of the 14 x 27 association table. It summarizes the data using a picture and enables us to speak of repositioning car models. The joint plot can be seen as the competitive landscape and soon the language of marketing warfare brings this simple 14 x 27 table to life. Where is the high ground or an opening for a new entry? How can we guard against an attack from below? This is sensemaking, but is it statistics?

I consider myself to be a marketing researcher, though with a PhD, I get more work calling myself a marketing scientist. I am a data analyst and not a statistician, yet in casual conversation I might say that I am a statistician in the hope that the label provides some information. It seldom does.

I deal in sensemaking. First, I attempt to understand how consumers make sense of products and decide what to buy. Then, I try to represent what I have learned in a form that assists in strategic marketing. My audience has no training in research or mathematics. Statistics plays a role and R helps, but I never wanted to be a statistician. Not that there is anything wrong with that.

Statistical Models of Judgment and Choice: Deciding What Matters Guided by Attention and Intention

2015-07-27T15:49:00.000-07:00

Preference begins with attention, a form of intention-guided perception. You enter the store thirsty on a hot summer day, and all you can see is the beverage cooler at the far end of the aisle with your focus drawn toward the cold beverages that you immediately recognize and desire. Focal attention is such a common experience that we seldom appreciate the important role that it plays in almost every activity. For instance, how are you able to read this post? Automatically and without awareness, you see words and phrases by blurring everything else in your perceptual field.

Similarly, when comparing products and deciding what to buy, you construct a simplified model of the options available and ignore all but the most important features. Selective attention simultaneously moves some aspects into the foreground and pushes everything else into the background, such is the nature of perception and cognition.

[Note: see "A Sparsity-Based Model of Bounded Rationality" for an economic perspective.]

Given that seeing and thinking are sparse by design, why not extend that sparsity to the statistical models used to describe human judgment and decision making? That cooler mentioned in the introductory paragraph is filled with beverages that fall into the goal-derived category of "things to drink on a hot summer day" and each has its own list of distinguishing features. The statistical modeling task begins with many options and even more distinguishing features so that the number of potential predictors is large. However, any particular individual selectively attends to only a small subset of products and features. This is what we mean by sparse predictive models - many variables in the equation but only a few with nonzero coefficients.

[Note: In order not to get lost in two different terminologies, one needs to be careful not to confuse sparse models with most parameters equal to zero and sparse matrices, which deals with storing and manipulating large data sets with lots of cells equal to zero.]

Statistical Learning with Sparsity

A direct approach might "bet on sparsity" and argue that only a few coefficients can be nonzero given the limitations of human attention and cognition. The R package glmnet will impose a budget on the total costs incurred from paying attention to many features when making a judgment or choice. Thus, with a limited span of attention, we would expect to be able to predict individual responses with only the most important features in the model. The modeler varies a tuning parameter controlling the limits of one's attention and watches predictors enter and leave the equation.

If everyone adopted the same purchase strategy, we could observe the purchase behavior of a group of customers and estimate a single set of parameters using glmnet. Instead of uniformity, however, we are more likely to find considerable heterogeneity with a mixture of different segments and substantial variation within each segment. All that is necessary to violate homogeneity is for the product category to have a high and low end, which is certainly the case with cold beverages. Now, the luxury consumer and the price sensitive will attend to different portions of the retail shelf and require that we be open to the possibility that our data are a mixture of different preference equations. Willingness to spend, of course, is but one of many possible ways of dividing up the product category. We could easily continue our differentiation of the cold beverage market by adding dimensions partitioning the cooler on the basis of calories, carbonation, coffees and teas, designer waters, alcohol content, and more.

Fragmented markets create problems for all statistical models assuming homogeneity and not just glmnet. Attention, the product of goal-directed intention, generates separated communities of consumers with awareness and knowledge of different brands and features within the same product category. The high-dimensional feature space resulting from the coevolving network of customer wants and product offerings forces us to identify a homogeneous consumer segment before fitting glmnet or any other predictive model. What matters in judgment and choice depends on where we focus our attention, which follows from our intentions, and intentions vary between individuals and across contexts (see Context Matters When Modeling Human Judgment and Choice).

Preference is constructed by the individual within a specific context as an intention to achieve some desired end state. Yet, the preference construction process tends to produce a somewhat limited result. A security camera placed by the rear beverage cooler would record a quick scan, followed by more activity as the search narrowed with a possible reset and another search begun or terminated without purchase. The beverage company has spent considerable money "teaching" you about their brand and the product category. You know what to look for before you enter the store because, as a knowledgeable consumer, you have learned a brand and product category representation and you have decided on an ideal positioning for yourself within this representation. For example, you know that beverages can be purchased in different size glass or plastic bottles, and you prefer the 12-ounce plastic bottle with a twist top.

Container preferences are but one of the building blocks acquired in order to complete the beverage purchase process. We can identify all the building blocks using nonnegative matrix factorization (NMF) and use that information to cluster consumers and features simultaneously. This is how we discover which consumers quickly find the regular colas and decide among the brands, types, favors, sizes, and containers available within this subcategory. Finally, we have our relatively homogenous dataset of regular cola drinkers for the glmnet R package to analyze. More accurately, we will have separate datasets for each joint customer-feature block and will need to derive different equations with different variables for different consumer segments.

"Models, Models Everywhere!" Brought to You by R

2015-07-21T10:09:00.000-07:00

Statistical software packages sell solutions. If you go to the home page for SAS, they will tell you upfront that they sell products and solutions. They link both together under the first tab just below "The Power to Know" mantra. SPSS separates product and solution into separate tabs, but places both next to each other on its home page as the first and second clicks. Obviously, both companies are in the solutions business; you have a problem, they have a solution. It's a good positioning to attract customers who are overworked and over their heads. To be clear, no one is questioning the analytics. SPSS and SAS are not selling snake oil, but they are selling something that is designed to appeal to potential customers with more money than time to spend.

R, on the other hand, appeals to the analyst looking outside the traditional box filled with a limited set of statistical models that keep us collecting the same data year after year and running the same analysis each time. My example comes from marketing research where we are repeatedly asked to do "something multivariate" with ratings of idealized features (e.g., cost without price points, quality lacking any specifications, and customer service stripped of context). Before you propose to replace the rating with some ranking task (e.g., MaxDiff), let me remind you that the problem is not the rating but the abstract feature without referent.

The solution is to get concrete if only our analytic tools did not lag behind our data collection capabilities. With decontextualized features we could pretend that we were all on the same page and speaking of the same thing. The details, however, reveal the heterogeneity of product usage and experience. The global space defined by price, quality and service becomes parallel worlds with concentrations of customers paying different amounts for product versions of varying quality with diverging expectations and needs for service. I have many more variables and even more missing data. More importantly, I have non-overlapping customer-feature blocks accompanying each community held together by common usage occasions.

This characterization of the data as local places within a global space came, not from marketing research, but from matrix factorization techniques for recommender systems. Modeling preferences for movies and songs have altered the way we look at all consumption. Everything has become more complex. The traditional clustering models started with feature selection and one set of variables for everyone. Similarly, although factorial invariance across distinct populations might require some preliminary examination, we believed that ultimately we would be able to identify a common group of respondents with which we could perform all dimension reduction. After Netflix and Spotify, all we can see are niche-genre pairings of customers and product features.

Of course, all of this is brought to you by R. SAS and SPSS need a business model before they incorporate the latest procedures. R, on the other hand, provides a platform for innovation by others, academics and entrepreneurs, willing to share and promote their best work. The result is a continuous stream of new ways of seeing and thinking embedded in a diverse collection of models and algorithms, which we call R packages. You can find a listing of all the innovative approaches for jointly blocking the rows and columns of a data matrix under the heading "Simultaneous Clustering in R" in my post The Ecology of Data Matrices.

Models are everywhere and from everywhere. R provides the interface enabling us to lift our heads out of our box and peak into the box down the road in someone else's field.