Pages

Sunday, November 1, 2015

Clustering Customer Satisfaction Ratings

We run our cluster analysis with great expectations, hoping to uncover diverse segments with contrasting likes and dislikes of the brands they use. Instead, too often, our K-means analysis returns the above graph of parallel lines indicating that the pattern of high and low ratings are the same for everyone but at different overall levels. The data come from the R package semPLS and look very much like what one sees with many customer satisfaction surveys.

I will not cover any specifics about the data, but instead refer you to earlier discussions of this dataset, first in a post showing its strong one-dimensional structure using biplots and later in an example of an undirected graph or Markov network displaying brand associations.

We will begin with the mean ratings for the four lines in the above graph and include a relatively small fifth segment in the last column with a different narrative. Ordering the 23 items from lowest to highest mean scores over the entire sample makes both the table below and the graph above easier to read.

not at all
a little
some
a lot
pricey
9%
27%
34%
21%
10%
FairPrice
4.1
5.1
6.4
8.2
5.6
BuyAgain
3.0
6.9
8.7
9.7
3.8
Responsible
4.3
6.2
6.9
8.1
7.0
GoodValue
4.8
5.7
7.3
8.6
7.0
ComplaintHandling
4.4
6.1
7.2
8.9
7.8
Fulfilled
4.8
6.0
7.5
8.6
7.8
IsIdeal
4.6
6.2
7.7
9.0
7.8
NetworkQuality
5.6
6.2
7.4
8.4
8.0
Recommend
3.8
6.7
8.4
9.6
7.2
ClearInfo
4.6
6.5
7.9
9.2
8.6
Concerned
5.3
6.4
8.1
9.0
8.1
QualityExp
6.0
7.1
7.6
8.6
7.9
CustomerService
4.8
6.7
8.0
9.3
8.5
MeetNeedsExp
6.1
7.1
7.3
8.7
8.4
GoWrongExp
7.0
6.2
7.5
8.5
8.5
Trusted
6.1
6.6
7.8
9.1
8.4
Innovative
5.8
7.4
8.1
9.2
8.2
Reliability
6.1
6.8
7.9
9.2
8.7
RangeProdServ
6.2
7.1
8.0
9.2
8.3
Stable
7.1
6.7
7.8
9.1
8.3
OverallQuality
6.3
7.0
8.2
9.2
8.5
ServiceQuality
6.3
7.1
7.9
9.4
8.6
OverallSat
6.4
7.3
8.2
8.9
8.7

You can pick any row in this table and see that the first four segments with 90% of the customers are ordered the same. The first cluster is simply not at all happy with their mobile phone provider. They give the lowest Buy Again and Recommend ratings. In fact, with only two small exceptions, they uniformly give the lowest scores. For every row the second column is larger (note the two discrepancies already mentioned), followed by an even bigger third column, and then the most favorable fourth column. Successful brands have loyal customers, and at least one out of five customers in this data have "a lot" of love with a mean ratings of 9.7 on a 10-point scale.

You can see why I labeled these four segments with names suggesting differing levels of attraction. Each group has the same profile, as can be seen in the largely parallel lines on our graph. The good news for our providers is that only 9% are definitely at risk. The bad news is that another 10% like the product and the service but will not buy again, perhaps because the price is not perceived as fair (see their graph below with a dip for the second variable, Buy Again, and a much lower score than expected for the first variable, Fair Price, given the elevation of the rest of the curve).



Some might argue that what we are seeing is merely a measurement bias reflecting a propensity among raters to use different portions of the scale. Does this mean that 90% of the customers have identical experiences but give different ratings due to some scale-usage predisposition? If it is a personality trait, does this mean that they use the same range of scale values to rate every brand and every product? Would we have seen individuals using the same narrow range of scores had the items been more specific and more likely to show variation, for example, if they had asked about dropped calls and dead zones rather than network quality?

Given questions without any concrete referent, the uniform patterns of high and low ratings across the items are shaped by a network of interconnected perceptions resulting from a common technology and a shared usage of that technology. In addition, one overhears a good deal of discussion about the product category in the media and from word-of-mouth so that even a nonuser might be aware of the pros and cons. As a result, we tend to find a common ordering of ratings with some customers loving it all "a lot" and others "not at all." Unless customers can provide a narrative (e.g., "I like the product and service, but it costs too much"), they will all reproduce the same profile of strengths and weaknesses at varying levels of overall happiness. That is, satisfied or not, almost everyone seems to rate value and price fairness lower than they score overall quality and satisfaction.

Finally, my two prior posts cited earlier may seem to paint a somewhat contradictory picture of customer satisfaction ratings. On the one hand, we are likely to find a strong first principal component indicating the presence of a single dimension underlying all the ratings. Customer satisfaction tends to be one-dimensional so that we might expect to observe the four clusters with parallel lines of ratings. Satisfaction falls for everyone as features and services become more difficult for any brand to deliver. On the other hand, the graph of the partial correlations suggests a network of interconnected pairs of ratings after controlling for the all the remaining items. One can identify regions with stronger relationships among items measuring quality, product offering, corporate citizenship, and loyalty.

Both appear to be true. Rating with the highest partial intercorrelations form local neighborhoods with thicker edges in our undirected graph. Although some nodes are more closely related, all the variables are still connected either directly with a pairwise edge or indirectly through a separating node. Everything is correlated, but some are more correlated than others.

R code needed to reproduce these tables and plots.

library("semPLS")
data(mobi)
 
# descriptive names for graph nodes
names(mobi)<-c("QualityExp",
               "MeetNeedsExp",
               "GoWrongExp",
               "OverallSat",
               "Fulfilled",
               "IsIdeal",
               "ComplaintHandling",
               "BuyAgain",
               "SwitchForPrice",
               "Recommend",
               "Trusted",
               "Stable",
               "Responsible",
               "Concerned",
               "Innovative",
               "OverallQuality",
               "NetworkQuality",
               "CustomerService",
               "ServiceQuality",
               "RangeProdServ",
               "Reliability",
               "ClearInfo",
               "FairPrice",
               "GoodValue")
 
# kmeans with 5 cluster and 25 random starts
kcl5<-kmeans(mobi[,-9], 5, nstart=25)
 
# cluster profiles and sizes
cluster_profile<-t(kcl5$centers)
cluster_size<-kcl5$size
 
# row and column means
row_mean<-apply(cluster_profile, 1, mean)
col_mean<-apply(cluster_profile, 2, mean)
 
# Cluster profiles ordered by row means
# columns sorted so that 1-4 are increasing means
# and the last column has low only for buyagain & fairprice
# Warning: random start values likely to yield different order
sorted_profile<-cluster_profile[order(row_mean),c(4,3,1,2,5)]
 
# reordered cluster sizes and profiles
cluster_size[c(4,3,1,2,5)]/250
round(sorted_profile,2)
 
# plots for first 4 clusters
matplot(sorted_profile[,-5], type = c("b"), pch="*", lwd=3,
        xlab="23 Brand Ratings Ordered by Average for Total Sample",
        ylab="Average Ratings for Each Cluster")
title("Loves Me Little, Some, A Lot, Not At All")
 
# plot of last cluster
matplot(sorted_profile[,5], type = c("b"), pch="*", lwd=3,
        xlab="23 Brand Ratings Ordered by Average for Total Sample",
        ylab="Average Ratings for Last Cluster")
title("Got to Switch, Costs Too Much")
Created by Pretty R at inside-R.org

Wednesday, October 28, 2015

Graphical Modeling Mimics Selective Attention: Customer Satisfaction Ratings


As shown by the eye tracking lines and circles, there is more on the above screenshot than we can process simultaneously. Visual perception takes time, and we must track where the eye focuses by recording sequence and duration. The "50% off" and the menu items seems to draw the most attention, suggesting that the viewers were not men.

But what if the screen contained a correlation matrix?

The 23 mobile phone customer satisfaction ratings from an earlier post will serve as an illustration. The R code to access the data, calculate the correlation matrix and produce the graph can be found at the end of this blog entry.


All the correlations are positive, so we might tend to focus on the highest correlated pairs and then search for triplets with uniformly larger intercorrelations. Although 23x23 is not a particularly big matrix, it contains enough entities that uncovering a pattern is a difficult task.

Factor analysis is an option, yet it might impose more structure than desired. What if we believe that "overall quality" is an abstraction created from perceptions of the reliability and stability of the network and supportive services and not indicators reflecting the hidden presence of a latent quality dimension? That is, we want to maintain all those individual ratings and their separate pairwise connections as shown in the correlation matrix. Well, a graph might assist those of us with a short span of attention, for example, an undirected graph whose nodes are the individual ratings and whose edges represent the correlations between pairs of ratings.


The green indicate positive values, and the largest correlations have the thickest paths. Though everything is interconnected, the graph aligns neighbors with the strongest connections. One could track your eye movements as you attempt to discover some spatial organization: overall satisfaction centered amidst quality and service with retention and recommendation pulled toward good value and fair price. Alternatively, you might have noted four regions: quality toward the right, innovative and range of products/services near the top, service and support on the top-right side, and the final word on value and loyalty in the bottom-right corner.

Hopefully, the eye tracking analogy clarifies that the interpretative process involved in making sense of the graph mimics the graphical modeling that factors or decomposes a complex network into groupings of local relationships. There is just too many pairwise relationships for the graph or the person to assimilate it all in a single glance. Selective attention deconstructs the perceptual field, in this case, mobile phone customer experiences with their cellular providers.

Of course, all decompositions are not equally helpful for customers deciding whether or not to continue with their current product and provider. We must remember that our consumer is not alone and product purchase is not a solitary quest. In order to understand product reviews, marketing communications and user comments, the consumer tends to adopt the prevailing factorization shared by most in the market and shown in the above graph.

Finally, we are not choosing a factor analysis model because we do not believe in a directed graphical representation with hidden latent constructs generating the satisfaction ratings. To be clear, we could have run a factor analysis and identified factors. The factors, however, would be derivative and not generative. The undirected graph preserves the primacy of the separate ratings and represents the factors in the edges as local regions of higher connectivity.


R code to read the data, print the correlation matrix, and plot the correlation network map.

library("semPLS")
data(mobi)
 
# descriptive names for graph nodes
names(mobi)<-c("QualityExp",
               "MeetNeedsExp",
               "GoWrongExp",
               "OverallSat",
               "Fulfilled",
               "IsIdeal",
               "ComplaintHandling",
               "BuyAgain",
               "SwitchForPrice",
               "Recommend",
               "Trusted",
               "Stable",
               "Responsible",
               "Concerned",
               "Innovative",
               "OverallQuality",
               "NetworkQuality",
               "CustomerService",
               "ServiceQuality",
               "RangeProdServ",
               "Reliability",
               "ClearInfo",
               "FairPrice",
               "GoodValue")
 
# prints the correlation matrix
round(cor(mobi[,-9]),2)
 
# plots the correlation network map
library("qgraph")
qgraph(cor(mobi[,-9]), layout="spring",
       labels=names(mobi[-9]), label.scale=FALSE,
       label.cex=1, node.width=.5, minimum=.3)

Created by Pretty R at inside-R.org

Tuesday, October 13, 2015

The Network Underlying Consumer Perceptions of the European Car Market


The nodes have been assigned a color by the author so that the underlying distinctions are more pronounced. Cars that are perceived as Economical (in aquamarine) are not seen as Sporty or Powerful (in cyan). The red edges connecting these attributes indicate negative relationships. Similarly, a Practical car (in light goldenrod) is not Technically Advanced (in light pink). This network of feature associations replicates both the economical to luxury and the practical to advanced differentiations so commonly found in the car market. North Americans living in the suburbs may need to be reminded that Europe has many older cities with less parking and narrower streets, which explains the inclusion of the city focus feature.

The data come from the R package plfm, as I explained in an earlier post where I ran a correspondence analysis using the same dataset and where I described the study in more detail. The input to the correspondence analysis was a cross tabulation of the number of respondents checking which of the 27 features (the nodes in the above graph) were associated with each of 14 different car models (e.g., Is the VW Golf Sporty, Green, Comfortable, and so on?).

I will not repeat those details, except to note that the above graph was not generated from a car-by-feature table with 14 car rows and 27 feature columns. Instead, as you can see from the R code at the end of this post, I reformatted the original long vector with 29,484 binary entries and created a data frame with 1092 rows, a stacking of the 14 cars rated by each of the 78 respondents. The 27 columns, on the other hand, remain binary yes/no associations of each feature with each car. One can question the independence of the 1092 rows given that respondent and car are grouping factors with nested observations. However, we will assume, in order to illustrate the technique, that cars were rated independently and that there is one common structure for the 14-car European market. Now that we have the data matrix, we can move on to the analysis.

As in the last post, we will model the associative net underlying these ratings using the IsingFit R package. I would argue that it is difficult to assert any causal ordering among the car features. Which comes first in consumer perception, Workmanship or High Trade-In Value? Although objectively trade-in value depends on workmanship, it may be more likely that the consumer learns first that the car maintains its value and then infers high quality. A possible resolution is to treat each of the 27 nodes as a dependent variable in their own regression equation with the remaining nodes as predictors. In order to keep the model sparse, IsingFit fits the logistic regressions with the R package glmnet.

For instance, when Economical is the outcome, we estimate the impact of the other 26 nodes including Powerful. Then, when Powerful is the outcome, we fit the same type of model with coefficients for the remaining 26 features, one of which is Economical. There is nothing guaranteeing that the two effects will be the same (i.e., Powerful's effect on Economical = Economical's effect on Powerful, controlling for all the other features). Since an undirected graph needs a symmetric affinity matrix as input, IsingFit checks to determine if both coefficients are nonzero (remember that sparse modeling yields lots of zero weights) and then averages the coefficients when Economical is in the Powerful model and Powerful is in the Economic model (called the AND rule).

Hastie, Tibshirani and Wainwright refer to this approach as "neighborhood-based" in their chapter on graph and model selection. Two nodes are in the same neighborhood when mutual relationships remain after controlling for everything else in the model. The red edge between Economical and Powerful indicates that each was in the other's equation and that their average was negative. IsingFit output the asymmetric weights in a data matrix called asymm.weights (Res$weiadj is symmetric after averaging). It is always a good idea to check this matrix and determine if we are justified in averaging the upper and lower triangles.

It should be noted that the undirected graph is not a correlation network because the weighted edges represent conditional independence relationships and not correlations. You need only go back to the qgraph() function and replace Res$weiadj with cor(rating) or cor_auto(rating) in order to plot the correlation network. The qgraph documentation explains how cor_auto() checks to determine if a Pearson correlation is appropriate and substitutes a polychoric when all the variables are binary.

Sacha Epskamp provides a good introduction to the different types of network maps in his post on Network Model Selection Using qgraph. Larry Wasserman covers similar topics at an advanced level in this course on Statistical Machine Learning. There is a handout on Undirected Graphical Models along with two YouTube video lectures (#14 and #15). Wasserman raises some concerns about our ability to estimate conditional independence graphs when the data does not have just the right dependence structure (not too much and not too little), which is an interesting point-of-view given that he co-teaches the class with Ryan Tibshirani, whose name is associated with the lasso and sparse modeling.

# R code needed to reproduce the undirected graph
library(plfm)
data(car)
 
# car$data$rating is length 29,484
# 78 respondents x  14 cars x 27 attributes
# restructure as a 1092 row data frame with 27 columns
rating<-data.frame(t(matrix(car$data$rating, nrow=27, ncol=1092)))
names(rating)<-colnames(car$freq1)
 
# fits conditional independence model
library(IsingFit)
Res <- IsingFit(rating, family='binomial', plot=FALSE)
 
# Plot results:
library("qgraph")
# creates grouping of variables to be assigned different colors.
gr<-list(c(1,3,8,20,25), c(2,5,7,23,26), c(4,10,16,17,21,27), 
         c(9,11,12,14,15,18,19,22))
node_color<-c("aquamarine","lightgoldenrod","lightpink","cyan")
qgraph(Res$weiadj, fade = FALSE, layout="spring", groups=gr, 
       color=node_color, labels=names(rating), label.scale=FALSE, 
       label.cex=1, node.width=.5)
Created by Pretty R at inside-R.org

Thursday, October 8, 2015

The Graphical Network Associated with Customer Churn

The node representing "Will Not Stay" draws our focus toward the left side of the following undirected graph. Customers of a health care insurance provider were asked about their intentions to renew at the next sign-up period. We focus on those indicating the greatest potential for defection by creating a binary indicator separating those who say they will not stay from everyone else. In addition, before telling us whether or not they intended to switch health care providers, these customers were given a checklist and instructed to check all the events that recently occurred (e.g., price increases, higher prescription costs, provider not covering all expenses, hospital and doctor visits, and customer service contacts).

We should note that all we have are customer perceptions. There is no electronic record of price increases, claim rejections, direct billings by MDs or hospitals, a customer service contact, or doctor and hospital visits. That is, we do not have measures of the event occurrences that are independent of defection intention. Consequently, we have no justification for drawing an arrow from Premiums Increases to Will Not Stay because the decision to churn impacts the willingness to check the Premium Up box. For example, everyone in the United States is likely to see some increase in their premiums, yet your willingness to check "yes" may depend on what else has occurred in your relationship with the insurance provider. Those wanting to remain dismiss the price increase as inflation or reframe it as essentially the same price, while those thinking of flight are more likely to take notice and affront. It might help to think of this as a form of cognitive dissonance or simply selective attention. Regardless of the specifics of the cognitive and affective processes, the result is an undirected graph with every node is both an outcome and a predictor.

The thickness of the lines indicate the strength of the connections. These edges represent the relationship between nodes controlling for all the other nodes in the graph. A checklist was provided so that all we have is a data matrix with either yes (=1) or no (=0). As I explained above, the only rating scale was dichotomized into Will Not Stay versus any other response. The data are proprietary so that all I can tell you is that there were more than a thousand customers, and each row was a profile of 11 binary variables coded zero or one. On the other hand, I can share the four lines of R code needed to run the analysis using the IsingFit R package and a data frame called "events2" with 11 columns and lots of rows containing only zeros and ones (see the end of this post). In addition, I can provide the link to an comprehensive overview of the methodology, A New Method for Constructing Networks from Binary Data. Those seeking more will find the notes from Sacha Epskamp workshop very helpful.

Getting back to our network, it seems that when Premiums go up, so do Deductibles and Co-pays. Cost increases form a clique near the bottom of the graph with edges suggesting that anticipated defection co-varies with price increases. A similar effects can be seen for prescription costs near the top. However, nothing seems to encourage exit more than a provider's failure to pay. Or, at least those who will not stay checked the box associated with the provider not paying. Moreover, we can observe some separation and independence in this undirected graph. Visiting your doctor, a specialist or going to the hospital have positive connections to customer churn only through the receipt of a bill or a customer service contact.

Hopefully, this example demonstrates that a lot can be learned from a undirected graphical representation of dichotomous survey data. Bayesian networks, more correctly called directed graphs, seem to attract a good deal of attention in marketing (e.g., BayesiaLab), as do structural equations models (see my previous post on Undirected Graphs When the Causality Is Mutual). In fact, my first post in this blog, Network Visualization of Key Driver Analysis, demonstrates how much can be summarizes quickly and clearly in an undirected graph. Another post, Metaphors Matters, compares factor analysis and correlation network maps.

To be clear, a graph displays an adjacency matrix that can contain any measure, often an index of association, affiliation or affinity. Any similarity or distance matrix can be graphed. Thus, we need to be careful when we interpret the resulting graphs. In this case, the adjacency matrix contained the averaged coefficients from a sparse logistic regression with each node as the dependent variable and all the remaining nodes as predictors. This means that our graph is not a correlation network because the adjacency matrix does not contain correlations. It is more like a partial correlation network, except that the adjacency matrix does not contain partial correlations but something that can be interpreted like a partial correlation. Fortunately, you can work with the graph as representing the relationship between two nodes controlling for the rest while you learn the details of Ising discrete data graphing.



### Fit using IsingFit ###
library(IsingFit)
Res <- IsingFit(events2, family='binomial', plot=FALSE)
 
# Plot results:
library("qgraph")
qgraph(Res$weiadj, fade = FALSE, layout="spring", 
       labels=names(events2), label.scale=FALSE, 
       label.cex=1, node.width=.5)

Created by Pretty R at inside-R.org

Monday, October 5, 2015

Undirected Graphs When the Causality Is Mutual

Structural equation models impose causal order on a set of observations. We start with a measurement model: a list of theoretical constructs and a table assigning what is observed (manifest) to what is hidden (latent). Although it is possible to think of this assignment as formative rather than reflective, the default is a causal connection with the latent variables responsible for the observed scores. Next, we draw arrows specifying the cause and effect relationships among the latent variables. All of this is shown in great detail with a customer satisfaction example in the very well-written vignette for the R package semPLS, which uses partial least squares (PLS) to fit structural equations models (sem).

Your focus should be on the causal model and not the estimation technique. PLS is optional, and all the parameters can be estimated using maximum likelihood with the lavaan R package. However, you can get access to the dataset through the semPLS package, and you will not find a better description of this particular example or the steps involved in specifying and testing a SEM.

As always, there are issues. An earlier post raises a number of concerns with this tale of causal links suggesting that we might be asked to assume too much when we impose a directionality on mutually interacting components. For example, when it requires effort to change product or service providers, it might be easier to believe that all competitors are the same and that it is futile to seek a better deal elsewhere. Here, the decision to Buy Again encourages us to rethink our dissatisfaction and raise the ratings over that which would have been given had switching been easier. Such mutual dependencies is represented by undirected graphs, and for social scientists, the R package qgraph provides an introduction.

My goal in this post is a modest one: to demonstrate that one can learn a great deal from a series of customer ratings without needing to force the data into a causal model. This is achieved by examining the following partial correlation network.

You should recall that a graph is a visual display of some adjacency matrix. In this case we define adjacency as the partial correlation between two nodes after controlling for all the other nodes in the graph. Actually, our adjacency matrix is a bit more complicated because we applied the graphical lasso to obtain our estimates. The details are important, yet one can learn a great deal from the graph knowing little more than that the edges show us conditional association after removing the other nodes and that we have made some effort to eliminate as many edges as possible (a sparse undirected graph).

All the R code needed to replicate this analysis appears at the end of this post. One of the original 24 items, # 9 SwitchforPrice, was removed because it had no edge to any of the other nodes in this partial correlation network (the semPLS documentation reveals that the question had a unique format).

One way to start is to identify the thickest edges connecting the remaining 23 customer perception, satisfaction and loyalty ratings. Unsurprisingly, good value and fair price "hang together" since endorsing one and rejecting the other would seem to be a contradiction. Similarly, stability is a key component of network quality, reliability defines service quality, and we do not recommend that which we are unwilling to buy again. These single edges connecting two ratings with common meanings may not be that informative.

What is interesting, however, is that we can read "the customer's mind" from the structure of the undirected graph.  First, all the quality measures form a grouping toward the left of the graph: stable, network quality, reliability, service quality, and overall quality. As we move toward the right, we encounter overall satisfaction along with its companion positive perceptions of trusted and fulfilled. In the region just above fall the product and service attributes with range of products and services, innovative, and customer service. Corporate responsibility is more toward the left with the loyalty measures below (e.g., buy again and recommend).

In general, expectations (go wrong, quality, and meet needs) are toward the top and behaviors near the bottom (compliant handling, recommend, and buy again). The most basic quality indicators are found on the left with the extras, such as good citizenship, appearing on the right (concerned, responsible, fair price, and good value).

Over time, customers form impressions and reach conclusions about the companies providing them goods and services. These attributions are mutually supportive and create a system of interdependencies that seeks an equilibrium. Disturbing that equilibrium anywhere within the system will have its consequences. A company that provides small incentives to current customers in order to encourage them to recruit new customers gets both the new customers and recommending customers with higher satisfaction and improved impressions. Recommendation is more than the result of a sequential causal process with satisfaction as an input. The incentive is an intervention with satisfaction as the outcome. The causality is mutual.


library("semPLS")
data(mobi)
 
# descriptive names for graph nodes
names(mobi)<-c("QualityExp",
              "MeetNeedsExp",
              "GoWrongExp",
              "OverallSat",
              "Fulfilled",
              "IsIdeal",
              "ComplaintHandling",
              "BuyAgain",
              "SwitchForPrice",
              "Recommend",
              "Trusted",
              "Stable",
              "Responsible",
              "Concerned",
              "Innovative",
              "OverallQuality",
              "NetworkQuality",
              "CustomerService",
              "ServiceQuality",
              "RangeProdServ",
              "Reliability",
              "ClearInfo",
              "FairPrice",
              "GoodValue")
 
library("qgraph")
 
# Calculates Sparse Partial Correlation Matrix
sparse_matrix<-EBICglasso(cor(mobi[,-9]), n=250)
 
# Plots results
ug<-qgraph(sparse_matrix, layout="spring", 
           labels=names(mobi[-9]), label.scale=FALSE,
           label.cex=1, node.width=.5)

Created by Pretty R at inside-R.org

Thursday, August 6, 2015

Matrix Factorization Comes in Many Flavors: Components, Clusters, Building Blocks and Ideals

Unsupervised learning is covered in Chapter 14 of The Elements of Statistical Learning. Here we learn about several data reduction techniques including principal component analysis (PCA), K-means clustering, nonnegative matrix factorization (NMF) and archetypal analysis (AA). Although on the surface they seem so different, each is a data approximation technique using matrix factorization with different constraints. We can learn a great deal if we compare and contrast these four major forms of matrix factorization.

Robert Tibshirani outlines some of these interconnections in a group of slides from one of his lectures. If there are still questions, Christian Thurau's YouTube video should provide the answers. His talk is titled "Low-Rank Matrix Approximations in Python," yet the only Python you will see is a couple of function calls that look very familiar. R, of course, has many ways of doing K-means and principal component analysis. In addition, I have posts showing how to run nonnegative matrix factorization and archetypal analysis in R.

As a reminder, supervised learning also attempts to approximate the data, in this case the Ys given the Xs. In multivariate multiple regression, we have many dependent variables so that both Y and B are matrices instead of vectors. The usual equation remains Y = XB + E, except that Y, B and E are all matrices with as many rows as the number of observations and as many columns as the number of outcome variables. The error is made as small as possible as we try to reproduce our set of dependent variables as closely as possible from the observed Xs.


K-means and PCA

Without predictors we lose our supervision and are left to search for redundancies or patterns in our Ys without any Xs. We are free to test alternative data generation processes. For example, can variation be explained by the presence of clusters? As shown in the YouTube video and the accompanying slides from the presentation, the data matrix (V) can be reproduced by the product of a cluster membership matrix (W) and a matrix of cluster centroids (H). Each row of W contains all zeros except for a single one that stamps out that cluster profile. With K-means, for instance, cluster membership is all-or-none with each cluster represented by a complete profile of averages calculated across every object in the cluster. The error is the extent that the observations in each grouping differs from their cluster profile.


Principal component analysis works in a similar fashion, but now the rows of W are principal component scores and H holds the principal component loadings. In both PCA and K-means, V = WH but with different constraints on W and H. W is no longer all zeros except for a single one, and H is not a collection of cluster profiles. Instead, H contains the coefficients defining an orthogonal basis for the data cloud with each successive dimension accounting for a decreasing proportion of the total variation, and W tells us how much each dimension contributes to the observed data for every observation.

An early application to intelligence testing serves as a good illustration. Test scores tend to be correlated positively so that all the coefficients in H for the first principal component will be positive. If the tests include more highly intercorrelated verbal or reading scores along with more highly intercorrelated quantitative or math scores, then the second principal component will be bipolar with positive coefficients for verbal variables and negative coefficients for quantitative variables. You should note that the signs can be reversed for any row of H for such reversal only changes direction. Finally, W tells us the impact of each principal component on the observed test scores in data matrix V.

Smart test takers have higher first principal components that uniformly increase all the scores. Those with higher verbal than quantitative skills will also have higher positive values for their second principal component. Given its bipolar coefficients, this will raise the scores on the verbal test and lower the scores on the quantitative tests. And that is how PCA reproduces the observed data matrix.

We can use the R package FactoMineR to plot the features (columns) and objects (rows) in the same space. The same analysis can be performed using the biplot function in R, but FactoMineR offers much more and supports it all with documentation. I have borrowed these two plot from an earlier post, Using Biplots to Map Cluster Solutions.


FactoMineR separates the variables and the individuals in order not to overcrowd the maps. As you can see from the percent contributions of the two dimensions, this is the same space so that you can overlay the two plots (e.g., the red data points are those with the highest projection onto the Floral and Sweetness vectors). One should remember that vector spaces are shown with arrows, and scores on those variables are reproduced as orthogonal projections onto each vector.

The prior post attempted to show the relationship between a cluster and a principal component solution. PCA relies on a "new" dimensional space obtained through linear combinations of the original variables. On the other hand, clusters are a discrete representation. The red points in the above individual factor map are similar because they are of the same type with any differences among these red dots due to error. For example, sweet and sour (medicinal on the plot) are taste types with their own taste buds. However, sweet and sour are perceived as opposites so that the two clusters can be connected using a line with sweet-and-sour tastes located between the extremes. Dimensions always can be reframed as convex combinations of discrete categories, rendering the qualitative-quantitative distinction somewhat less meaningful.


NMF and AA

It may come as no surprise to learn that nonnegative matrix factorization, given it is nonnegative, has the same form with all the elements of V, W, and H constrained to be zero or positive. The result is that W becomes a composition matrix with nonzero values in a row picking the elements of H as parts of the whole being composed. Unlike PCA where H may represent contrasts of positive and negative variable weights, H can only be zero or positive in NMF. As a result, H bundles together variables to form weighted composites.

The columns of W and the rows of H represent the latent feature bundles that are believed to be responsible for the observed data in V. The building blocks are not individual features but weighted bundles of features that serve a common purpose. One might think of the latent bundles using a "tools in the toolbox" metaphor. You can find a detailed description showing each step in the process in a previous post and many examples with the needed R code throughout this blog.

Archetypal analysis is another variation on the matrix factorization theme with the observed data formed as convex combinations of extremes on the hull that surrounds the point cloud of observations. Therefore, the profiles of these extremes or ideals are the rows of H and can be interpreted as representing opposites at the edge of the data cloud. Interpretation seems to come naturally since we tend to think in terms of contrasting ideals (e.g., sweet-sour and liberal-conservative).

This is the picture used in my original archetypal analysis post to illustrate the point cloud, the variables projected as vectors onto the same space, and the locations of the 3 archetypes (A1, A2, A3) compared with the placement of the 3 K-means centroids (K1, K2, K3). The archetypes are positioned as vertices of a triangle spanning the two-dimensional space with every point lying within this simplex. In contrast, the K-means centroids are pulled more toward the center and away from the periphery.
Why So Many Flavors of Matrix Factorization?

We try to make sense of our data by understanding the underlying process that generated that data. Matrix factorization serves us well as a general framework. If every variable was mutually independent of all the rest, we would not require a matrix H to extract latent variables. Moreover, if every latent variable had the same impact for every observation, we would not require a matrix W holding differential contributions. The equation V = WH represents that the observed data arises from two sources: W that can be interpreted as if it were a matrix of latent scores and H that serves as a matrix of latent loadings. H defines the relationship between observed and latent variables. W represents the contributions of the latent variables for every observation. We call this process matrix factorization or matrix decomposition for obvious reasons.

Each of the four matrix factorizations adds some type of constraint in order to obtain a W and H. Each constraint provides a different view of the data matrix. PCA is a variance-maximizers yielding a set of components accounting for the most variation independent of all preceding components. K-means gives us boxes with minimum variation within each box. We get building blocks and individualized rules of assembly from NMF. Finally, AA frames observations as compromises among ideals or archetypes. The data analyst must decide which story best fits their data.

Monday, August 3, 2015

Sensemaking in R: A Plenitude of Models Makes for Good Storytelling

"Sensemaking is a motivated, continuous effort to understand connections (which can be among people, places, and events) in order to anticipate their trajectories and act effectively."
Making Sense of Sensemaking 1 (2006)


Story #1: A Tale of Causal Links

A causal model can serve as a sensemaking tool. I have reproduced below a path diagram from an earlier post organizing a set of customer ratings based on their hypothesized causes and effects. As shown on the right side of the graph, satisfaction comes first and loyalty follows with input from image and complaints. Value and Quality perceptions are positioned as drivers of satisfaction. Image seems to be separated from product experience and causally prior. Of course, you are free to disagree with the proposed causal structure. All I ask is that you "see" how such a path diagram can be imposed on observed data in order to connect the components and predict the impact of marketing interventions.


Actually, the nodes are latent variables, and I have not drawn in the measurement model. The typical customer satisfaction questionnaire has many items tapping each construct. In my previous post referenced above, I borrowed the mobile phone dataset from the R package semPLS, where loyalty was assessed with three ratings: continued usage, switching to lower price competitor, and likelihood to recommend. These items are seen as indicators of a commitment and attachment, and their intercorrelations are due to their common cause, which we have labeled as Loyalty.

Where Do Causal Models Come From? The data were collected at one point in time, but it is difficult not to impose a learning sequence on the ratings. That is, the analyst overlays the formation process onto the data as if the measurements were made as learning occurred. Brand image is believed to be acquired first and expectation thought to be formed before the purchase is made. Product experience is understood to come next in the sequence, followed by an evaluation and finally the loyalty decisions to continue using and recommend to others. 

As I argued in the prior post, causation is not in the data because the ratings were not gathered over time. By the time the questionnaire is seen, dissonance has already worked its way backward creating consistencies in the ratings. For instance, when switching is a chore, satisfaction and product perceptions are all higher than they would have been had changing providers been an easier task. In a similar manner, reluctantly recommending only when forced for your opinion may reverse the direction of the arrows and at least temporarily raise all ratings. We shall see in the next section how ratings are interconnected by a network of consumer inferences reflecting not observed covariation but belief and semantics.


Story #2: Living on a One-Dimensional Love-Hate Manifold (Halo Effects)

Our first sensemaking tool, structural equation modeling, was shaped by an intricate plot with many characters playing fixed causal roles. Few believe that this is the only way to make sense of the connections among the different ratings. For some, including myself, the causal model seems a bit too rational. What happened to affect? Halo effects are thought of as a cognitive bias, but all summaries introduce bias measured by the variation about the centroid. In the case of customer satisfaction and loyalty, a pointer on a single evaluative dimension can reproduce all the ratings. You tell me that you are very satisfied with your mobile phone provider, and I can predict that you are not dropping a lot of calls.

The halo effect functions as a form of data comprehension. We learn what constitutes a "good" product or service before we buy. These are the well-formed performance expectations that serve as the tests for deciding satisfaction. We are upset when the basic functions that are must-haves are not delivered (e.g., failure of our mobile phone to pair with the car's Bluetooth), and we are delighted when extras are included that we did not expect (e.g., responsive customer support). Most of these expectations lie just below awareness until experienced (e.g., breakage and repair costs when dropped short distance or onto relatively soft surface).

This representation orders features and services as milestones along a single dimension so that one can read one's overall satisfaction from their position along this path. You may be familiar with the usage of such sensemaking tools in measuring achievement (e.g., spelling ability is assessed by the difficulty of words that one can spell) or political ideology (e.g., a legislator's position along the liberal-conservative continuum depends on the bills voted for and against). Thus, I assess your spelling ability by the difficulty of the words you can spell. I determine how liberal or conservative you are by the issues you support or oppose. And I evaluate brands and their products by the features and services they are able to provide. We simply reanalyze the same customer satisfaction rating data. The graded response model from the ltm R package will order both customers and the rating items along the same latent satisfaction dimension, as shown in my post Item Response Modeling of Customer Satisfaction.

Perhaps you noticed that we have changed our perspective or shifted to a new paradigm. Feature ratings are no longer drivers of satisfaction, instead they have become indicators of satisfaction. In Story #1, a Tale of Causal Links, the arrows go from the features to satisfaction and loyalty. Driver analysis accumulates satisfaction feature by feature with each adding a component to the overall reservoir of goodwill. However, in Story #2 all the ratings (features, satisfaction, and loyalty) fall along the same evaluative continuum from rage to praise. We can still display the interrelationship with a diagram, thought we need to drop the arrows for everything is interconnected in this network.

The manifold from Story #2 makes sense of the data by ranking features based on performance expectations. Some features and services are basic and everyone scores well. The premium features and services, on the other hand, are those not provided by every product. Customers decide what they want and are willing to pay, and then they assess the ability of the purchased product to deliver. This is not a driver analysis for the assessment of each component is not independent of the other components.

Those of us willing to live with the imperfections of our current product tend to rate the product higher in a backward adjustment from loyalty to feature performance. You do something similar when you determine that switching is useless because all the competitors are the same. Can I alter your perceptions by tempting you with a $100 bonus or a free month of service to recommend a friend? It's a network of jointly determined nodes with a directionality represented by the love-hate manifold. The ability to generate satisfaction or engender loyalty is but another node, different from product quality perceptions, yet still part of the network.

How else can you explain how randomly attaching a higher price to a bottle of wine yields higher ratings for taste? Price changes consumer perceptions of quality because consumers make inferences about uncertain features based on what they know about more familiar features. When asked about customer support, you can answer even if you have never contacted or used customer support. You simply fill in a rating with an inference from other features with which you are more familiar or you simply assume it must be good or bad because you are happy or unhappy overall. Such a network analysis can be done with R, as can the driver analysis from our first story.