Pages

Thursday, May 15, 2014

The Mind Is Flat! So Stop Overfitting Choice Models


Conjoint analysis and choice modeling rely on repeated observations from the same individuals across many different scenarios where the features have been systematically manipulated in order to estimate the impact of varying each feature. We believe that what we are measuring has substance and existence independent of the measurement process. Nick Chater, my source for this eerie figure depicting the nature of self-perception, lays to rest this "illusion of depth" in a short video called "The Mind is Flat." We do not possess the cognitive machinery demanded by utility theory. When we "make up our mind," we are literally making it up. Features do not possess value independent of the decision context. Features acquire value as we attempt to choose one option from many alternatives. Consequently, whatever consistency we observe results from reoccurring situations that constrain preference construction and not because of some seemingly endless store of utilities buried deep in our memories.

Although it is convenient for the product manager to think of their offerings as bundles of features and services, the consumer finds such a representation to be overwhelming. As a result, the choice modeler is forced to limit how much each respondent is shown. The conflict in choice modeling is between the product manager who wants to add more and more features to the bundle and the analyst who needs to reduce task complexity so that respondents will participate in the research. At times, fractional factorial designs fail to remove enough choice sets, so we turn to optimal configurations with acceptable confounding (see design of experiments in R). Still, even our reduced number of choice scenarios may be too many for any one individual, so we show only a few scenarios to each respondent, make a few restrictive assumptions about homogeneity (e.g., introduce hyperparameters specifying the relationships between individual- and group-level parameters), and then proceed with hierarchical Bayes to compute separate estimates for every person in the study.

We justify such data collection by arguing that it is an "as-if" measurement model. Of course, people cannot retain in memory the utilities associated with every possible feature or service level. Clearly, no one is capable of the mental arithmetic necessary to do the computation in their heads. Yet, we rationalize the introduction of such unrealistic assumption claiming that they allow us to learn what drives choice and decision making. Thus, by asking a consumer to decide among feature bundles using only the information provided by the experimenter, one can fit a model and estimate parameters that will predict behavior in this specific setting. But our findings will not generalize to the marketplace because we are overfitting. The estimated utilities work only for this one particular task. What we have learned from behavioral economics over the last 30 years is that what is valued depends on the details of the decision context.

For those of you wishing a more complete discussion of these issues, I will refer you to my previous posts on Context Matters When Modeling Human Judgment and Choice, Got Data from People?, and Incorporating Preference Construction into the Choice Modeling Process.

Ecological Data Collection and Latent Variable Modeling

I am not suggesting that we abandon choice modeling or hierarchical Bayes estimation. A well-designed choice study that carefully mimics the actual purchase context can reveal a good deal about the impact of varying a small number of features and services. However, if our concern is learning what will happen in the marketplace when the product is sold, we ought to be cautious. Order and context effects will introduce noise and limit generalizability. Multinomial logistic models, such as those in the bayesm R package, teach us that feature importance depends on the range of feature levels and the configuration of all the other features varied across the choice scenarios. We are no longer in the linear world of rating-based conjoint via multiple regression with its pie charts indicating the proportional contribution of each feature.

A good rule of thumb might be to include no more features than the number that would be shown on the product package or in a print ad. Our client's desire to estimate every possible aspect will only introduce noise and result in overfitting. On the other hand, simply restricting the number of features will not eliminate order effects. Whenever we present more than one choice scenario, we need to question whether our experimental arrangements have induced selection strategies that would not be present in the marketplace. Does varying price focus attention on price? Does the inclusion of one preferred feature level create a contrast effect and lower the appeal of the other feature levels? These effects are what we mean when we say the preference is not retrieved from a stable store kept in long-term memory.

It is unnecessary for consumers to store utilities because they can generate them on the fly given the choice task. "What do you feel like eating?" becomes a much easier question when you have a menu in your hands. We use the choice structure to simplify our task. I read down the menu imaging how each item might taste and select the most appealing one. I switch products or providers by comparing what I am using with the new offer. The important features are the ones that differentiate the two alternatives. If the task is difficult or I am not sure, then I keep what I have and preserve the status quo. In both cases context comes to our rescue.

The flexibility that characterizes human judgment and decision making flows from our willingness to adapt to the situation. That willingness, however, is not a free choice. We are not capable of storing, retrieving and integrating feature level utilities. You might remember the telephone game where one person writes down a message and whispers it to a second person, who whispers the message they heard to a third, and so on. Everyone laughs at the end of the telephone line when the last person repeats what they think they had heard and it is compared to what was written. Such is the nature of human memory.

We can avoid overfitting by reducing error and simplifying our statistical models. These are the two goals of statistical learning theory. We keep the choice task realistic and avoid order effects. Occam's razor will trim our latent variables down to a single continuous dimension or a handful of latent classes. For example, the offerings within a product category are structured along a continuum from basic to premium. The consumer learns what is available and decides where they personally fall along this same continuum. Do they get everything they want and need from the lower end, or is it worth it to them to pay more for the extras? The consumer exploits the structure of the purchase context in order to simplify their purchase decision. If our choice modeling removes those supports, it no longer reflects the marketplace.

Choice remains complex, but now the complexity lies in the recognition phase. That is, choice begins with problem recognition (e.g., I need to retrieve email away from my desktop or I want to watch movies on the go or both at the same time). Framing of the choice problem determines the ad hoc or goal-derived category, which in turn shapes the consideration set (e.g., smartphones only, tablets only, laptops only, or some combination of the three product categories) and determines the evaluative criteria to be used in this particular situation. This is why I called this section ecological data collection. It is the approach that Donald Norman promotes when designing products for people. For the choice modeler, it mean a shift in our statistical modeling from estimating feature-level utilities to pattern recognition and unsupervised learning.


Friday, May 9, 2014

Customer Satisfaction and Loyalty: Structural Equation Model or One-Dimensional Dissonance

Causal thinking is seductive. Product experience comes first, then feelings of satisfaction, and finally intentions to continue as a customer. Although customer satisfaction and loyalty data tend to be collected all at one time within the same questionnaire, who does not see the work of the invisible hand of causation? We call product and service ratings "drivers" of satisfaction because it is so easy to imagine experience impacting affect and intention. Thus, no one will be surprised to see the R package semPLS (Structural Equation Modeling using Partial Least Squares) using the customer satisfaction model as its example.

However, what if we were to ignore the causal model and look only at the data? The mobi dataset from the semPLS package is a data matrix with 250 rows containing ratings on a scale from 1 to 10 across 24 items measuring mobile phone customer satisfaction. All you need to know in order to run the SEM and interpret the results can be found in the above link to the well-written Journal of Statistical Software article. I, on the contrary, will pretend that I have no causal model and treat the data set as any other battery of ratings. Let us start by exploring the intercorrelations among these variables.

A principal component analysis for this 250 x 24 data matrix yields a first principal component accounting for 39.5% of the total variation and a second principal component that is only one-sixth the size of the first with 6.6%. The size of the first principal component tells us a great deal about the amount of redundancy in the data matrix. The biplot will provide the visualization.

A biplot is a graphic display showing the 24 variables as vectors and the 250 observations as points projected onto the two-dimensional principal component space. That is, we can create a two-dimensional map with the first principal component as the x-axis and the second principal component as the y-axis. We calculate the two principal component scores for every respondent and use points to represent each row. We know the correlation between each variable and the principal components (i.e., the factor loadings), and we can use that knowledge to project the variables as lines or vectors onto the principal component space. As you might recall, the higher the correlation between two variables, the smaller the angle between the lines representing these variables.
This plot was created using the BiplotGUI R package. It shows the distribution of the 250 rows across the first two principal components as red boxes. The lines are the 24 ratings with tick marks indicating the scores from 1 to 10. Only abbreviations are provided, but you can find the actual questions in the semPLS documentation. With the exception of one variable, all the ratings point in the same direction toward the right. Given the small angles between all these lines, one would expect the correlation matrix to be filled with positive correlations. The "L" in the CUSL# indicates loyalty. The second loyalty measure (would you switch providers for price discount) seems to go its own way.

You may have noticed an arc of non-red circles. I picked one of the points toward the right side to show the arc of predicted values for one respondent with a high first principal component score. The same way that you would drop a perpendicular from a point to the x- and y-axes in order to determine the point's location on those dimensions, one can read out the predicted rating by the perpendicular projection of a point onto that variable's line. These are the dotted lines shown by BiplotGUI. One can clearly see that a high first principal component score results in uniformly high ratings across all 24 ratings. This predicted rating would had been the actual rating had the first two principal components accounted for 100% of the variation.

The "GUI" in BiplotGUI indicates that the function opens its own window that allows you to interact with the biplot. This interaction will enable you to experience the relationship between a point's position in the two-dimensional principal component space and its scores on the 24 variables. The "arc" of varying colored circle tells us that those respondents toward the positive end of the first dimensions tended to rate everything higher.
The above biplot is identical to the first biplot except for the location of the arc, which is moved toward the mean. It illustrates the effect of a decreasing first principal component score. The first two principal components account for less than half of the total variation (39.5% + 6.6%), so there will be some discrepancy between our predicted and the actual ratings. Although it is not shown here, the BiplotGUI window provides a frame where you can see the actual and predicted ratings as you move the cursor across the biplot.

Let me show one more arc from the other side of the first principal component. Now the arc is located toward the lower end of the x-axis and shows how these respondents give uniformly lower ratings.
I would encourage you to install BiplotGUI and semPLS, copy the few lines of R code at the end of this post, and interact with the biplot by moving your cursor across the space. As I note in the R code, you will need to active "Predict points closest to cursor position" by right clicking on the biplot display. By selecting the Prediction tab in the adjacent frame, you will be able to see the actual and predicted ratings for each respondent. Seeing the consistency with which all the ratings move together as different respondents are selected might just change your mindset.

Do I really need all these separate latent constructs with their causal connections? Does Occam's Razor shred the structural equation model? True, the network display of a causal model, as shown below, does provide an organization for the 24 ratings. Conceptually, we can distinguish between performance, satisfaction, and loyalty. Obviously, over time, experience comes first, followed by feelings of satisfaction and then loyalty intentions. The directed graph makes causation a compelling inference.
But the ratings are all that I have, and those ratings were all collected on a single occasion. A longitudinal study may be able to separate performance, satisfaction, and loyalty. By the time I make by measurements, the directed arrows have become feedback loops. Thus, my evaluation of my brand's performance depends on whether or not a better alternative is available or how much effort is needed to switch providers. Sometimes it is convenient to believe that all companies are the same. On the other hand, once I decide to switch, all my ratings fall, that is, until I investigate further and discover problems with my "new" provider. I am not arguing that all the ratings will be identical. Every provider will have strengths and weaknesses. But mean-level item differences are not separate latent variables. As long as the ratings move together as a cohesive unit, we have one latent dimension.

To be clear, I am not claiming that unresolved problems will not impact satisfaction and retention. However, the key word is "unresolved" and the frequency of such problems tend to be low among current customers. The unhappy churn unless there are barriers to exit. Our data matrix is a mixture of respondents: a few "hostages" waiting to be freed, some "shoppers" looking for a better deal, a plurality of "inerts" who prefer not to think about it, and a brand-specific percentage of "advocates" making recommendations and active on social media. I have ordered these four components of our mixture model as they might appear along the first principal component. They are not well-separated, which is why a dimensional representation works so well (see this previous post for a more complete discussion of this issue).

In the end, perhaps it is cognitive dissonance, and not cause-and-effect, that binds the ratings together? Attitudes serve behavior. Do I switch or continue using or simply not think about it at all? Do I become an advocate for the brand? Each of these alternative courses of action result from a complex interplay of product experience with each customer's usage situation and personal needs. We cannot simply assume that whether or not one recommends the brand to others depends solely on the brand and not because there are personal gains and losses associated with the recommendation process that have nothing to do with the brand. Complex systems of thought and action can be described but not by causal models.

R code to run analysis in this post:

#load semPLS and datasets
library(semPLS)
data(mobi)
data(ECSImobi)
 
#runs PLS SEM
ecsi <- sempls(model=ECSImobi, data=mobi, E="C")
ecsi
 
#calculate percent variation
(prcomp(scale(mobi))$sdev^2)/24
 
#load and open BiPlotGUI
library(BiplotGUI)
Biplots(mobi, PointLabels=NULL)
#right click on biplot
#select "Predict points closest to cursor positions"
#Select Prediction Tab in top-right frame
Created by Pretty R at inside-R.org

Sunday, March 23, 2014

Warning: Clusters May Appear More Separated in Textbooks than in Practice

Clustering is the search for discontinuity achieved by sorting all the similar entities into the same piles and thus maximizing the separation between different piles. The latent class assumption makes the process explicit. What is the source of variation among the objects? An unseen categorical variable is responsible. Heterogeneity arises because entities come in different types. We seem to prefer mutually exclusive types (either A or B), but will settle for probabilities of cluster membership when forced by the data (a little bit A but more B-like). Actually, we are more likely to acknowledge that our clusters overlap early on and then forget because it is so easy to see type as the root cause of all variation.

I am asking the reader to recognize that statistical analysis and its interpretation extend over time. If there is variability in our data, a cluster analysis will yield partitions. Given a partitioning, a data analyst will magnify those differences by focusing on contrastive comparisons and assigning evocative names. Once we have names, especially if those names have high imagery, can we be blamed for the reification of minor distinctions? How can one resist segments from Nielsen PRIZM with names like "Shotguns and Pickups" and "Upper Crust"? Yet, are "Big City Blues" and "Low-Rise Living" really separate clusters or simply variations on a common set of dwelling constraints?

Taking our lessons seriously, we expect to see the well-separated clusters displayed in textbooks and papers. However, our expectations may be better formed than our clusters. We find heterogeneity, but those differences are not clumping or distinct concentrations. Our data clouds can be parceled into regions, although those parcels run into one another and are not separated by gaps. So we name the regions and pretend that we have assigned names to types or kinds of different entities with properties that control behavior over space and time. That is, we have constructed an ontology specifying categories to which we have given real explanatory powers.

Consider the following scatterplot from the introductory vignette in the R package mclust. You can find all the R code needed to produce these figures at the end of this post.

This is the Old Faithful geyser data from the "datasets" R package showing the waiting time in minutes between successive eruptions on the y-axis and the duration of the eruptions along the x-axis. It is worth your time to get familiar with Old Faithful because it is one of those datasets that gets analyzed over and over again using many different programs. There seems to be two concentrations of points: shorter eruption that occurs more quickly and longer eruptions that have a longer waiting period. If we told the Mclust function from the mclust package that the scatterplot contains observations from G=2 groups, the function would produce a classification plot that looked something like this: 

The red and the blue with their respective ellipses are the two normal densities that are getting mixed. It is such a straightforward example of finite mixture or latent class models (as these models are also called by analysts in other fields of research). If we discovered that there were two pools feeding the geyser, we could write a compelling narrative tying all the data together.

The mclust vignette or manual is comprehensive but not overly difficult. If you prefer a lecture, there is no better introduction to finite mixture than the MathematicalMonk YouTube video. The key to understanding finite mixture models is recognizing that the underlying latent variable responsible for the observed data is categorical, a latent class which we do not observe, but which explains the location and shape of the data points. Do you have a cold or the flu? Without medical tests, all we can observe are your symptoms. If we filled a room with coughing, sneezing, achy and feverish people, we would find a mixture of cold and flu with differing numbers of each type.

This appears straightforward, but for the general case, how does decide how many categories, in what proportions, and with what means and covariance matrices? That is, those two ellipses in the above figure are drawn using a vector with means for the x- and y-axes plus a 2x2 covariance matrix. The means move the ellipse over the space, and the covariance matrix changes the shape and orientation of the ellipse. A good summary of the possible forms is given in Table 1 of the mlcust vignette. 

Unfortunately, the mathematics of the EM algorithm used to solve this problem gets complicated quickly. Fortunately, Chris Bishop provides an intuitive introduction in a 2004 video lecture. Starting at 44:14 of Part 4, you will find a step-by-step description of how the EM algorithm works with the Old Faithful data. Moreover, in Chapter 9 of his book, Bishop cleverly compares the workings of the EM and k-means algorithms, leaving us with a better understanding of both techniques.

If only my data showed such clear discontinuity and could be tied to such a convincing narrative.

Product Categories Are Structured around the Good, the Better, and the Best

Almost every product category offers a range of alternatives that can be described as good, better, and best. Such offerings reflect the trade-offs that customers are willing to make between the quality of the products they want and the amount that they are ready to spend. High-end customers demand the best and will pay for it. On the other end, one finds customers with fewer needs and smaller budgets accepting less and paying less. Clearly, we have heterogeneity, but are those differences continuous or discrete? Can we tell by looking at the data?

Unlike the Old Faithful data with well-separated grouping of data points, product quality-price trade-offs look more like the following plot of 300 consumers indicating how much they are willing to spend and what they expect to get for their money (i.e., product quality is a composite index combining desired features and services).

There is a strong positive relationship between demand for quality and willingness to pay, so a product manager might well decide that there was opportunity for at least a high-end and a low-end option. However, there is no natural breaks in the scatterplot. Thus, if this data cloud is a mixture of distinct distributions, then these distributions must be overlapping. 

Another example might help. As shown by John Cook, the distribution of heights among adults is a mixture two overlapping normal distribution, one for men and another for women. Yet, as you can observe from Cook's plots, the mixture of men's and women's height does not appear bimodal because the separation between the two distributions is not large enough. If you follow the links in Cook's post, eventually you will find the paper "Is Human Height Bimodal?", which clearly demonstrates that many mixtures of distributions appear to be homogeneous. We simply cannot tell that they are mixture by looking just at the shape of the distribution for the combined data. The Old Faithful data with its well-separated bimodal curve provides a nice contrast, especially when we focus only on waiting time as a single dimension (Fitting Mixture Models with the R Package mixtools).

Perhaps then, segmentation does not require gaps in the data cloud. As Wedel and Kamakura note, "Segments are not homogeneous groupings of customers naturally occurring in the marketplace, but are determined by the marketing manager's strategic view of the market." That is, one could look at the above scatterplot, see the presence of a strong first principal component from the closest of the data points to the principal axis of the ellipse and argue that customer heterogeneity is a single continuous dimension running from the low- to the high-end of the product spectrum. Or, one could look at the same scatterplot and see three overlapping segments seeking basic, value and premium products (good, better, best). Let's run mclust and learn if we can find our three segments.


When instructed to look for three clusters, the Mclust function returned the above result. The 300 observed points are represented as a mixture of three distributions falling along the principal axis of the larger ellipse formed by all the observations. However, if I had not specified three clusters and asked Mclust to use its default BIC criterion to select the number of segments, I would have been told that there was no compelling evidence for more than one homogeneous group. Without any prior specification Mclust would have returned a single homogeneous distribution, although as you can see from the R code below, my 300 observations were a mixture of three equal size distributions falling along the principal axis and separated by one standard deviation.

Number of Segments = Number Yielding Value to Product Management

Market segmentation lies somewhere between mass marketing and individual customization. When mass marketing fails because customers have different preferences or needs and customization is too costly or difficult, the compromise is segmentation. We do not need "natural" grouping, but just enough coherence for customers to be satisfied by the same offering. Feet come in many shapes and sizes. The shoe manufacturer can get along with three sizes of sandals but not three sizes of dress shoes. It is not the foot that is changing, but the demands of the customer. Thus, even if segments are no more than convenient fictions, they can be useful from the manager's perspective.

My warning still holds. Reification can be dangerous. These segments are meaningful only within the context of the marketing problem created by trying to satisfy everyone with products and services that yield maximum profit. Some segmentations may return clusters that are well-separated and represent groups with qualitatively different needs and purchase processes. Many of these are obvious and define different markets. If you don't have a dog, you don't buy dog food. However, when segmentation seeks to identify those who feel that their dog is a member of the family, we will find overlapping clusters that we treat differently not because we have revealed the true underlying typology, but because it is in our interest. Don't be fooled into believing that our creative segment names reveal the true workings of the consumer mind.

Finally, what is true for marketing segmentation is true for all of cluster analysis. "Clustering: Science or Art?" (a 2009 NIPS workshop) raises many of these same issues for cluster analysis in general. Videos of this workshop are available at Videolectures. Unlike supervised learning with its clear criterion for success and failure, clustering depends on users of the findings to tell us if the solution is good or bad, helpful or not. On the one hand, this seems to make everything more difficult. On the other hand, it frees us to be more open to alternative methods for describing heterogeneity as it is now and how it evolves over time.  

We seek to understand the dynamic structure of diversity, which only sometimes takes the form of cohesive clusters separated by gaps. Other times, a model with only continuous latent variables seems to be the best choice (e.g., brand perceptions). And, not unexpectedly, there are situations where heterogeneity cannot be explained without both categorical and continuous latent variables (e.g., two or more segments seeking alternative benefit profiles with varying intensities).

Yet, even these three combinations cannot adequately account for all the forms of diversity we find in consumer data.  Innovation might generate a structure appearing more like long arrays or streams of points seemingly pulled toward the periphery by an archetypal ideal or aspirational goal. And if the coordinate space of k-means and mixture models becomes too limiting, we can replace it with pairwise dissimilarity and graphical clustering techniques, such as affinity propagation or spectral clustering. Nor should we be wedded to the stability of our segment solution when those segments were created by dynamic forces that continue to act and alter its structure. Our models ought to be as diverse as the objects we are studying.

R code for all figures and analysis

#attach faithful data set
data(faithful)
plot(faithful, pch="+")
 
#run mclust on faithful data
require(mclust)
faithfulMclust<-Mclust(faithful, G=2)
summary(faithfulMclust, parameters=TRUE)
plot(faithfulMclust)
 
#create 3 segment data set
require(MASS)
sigma <- matrix(c(1.0,.6,.6,1.0),2,2)
mean1<-c(-1,-1)
mean2<-c(0,0)
mean3<-c(1,1)
set.seed(3202014)
mydata1<-mvrnorm(n=100, mean1, sigma)
mydata2<-mvrnorm(n=100, mean2, sigma)
mydata3<-mvrnorm(n=100, mean3, sigma)
mydata<-rbind(mydata1,mydata2,mydata3)
colnames(mydata)<-c("Desired Level of Quality",
                    "Willingness to Pay")
plot(mydata, pch="+")
 
#run Mclust with 3 segments
mydataClust<-Mclust(mydata, G=3)
summary(mydataClust, parameters=TRUE)
plot(mydataClust)
 
#let Mclust decide on number of segments
mydataClust<-Mclust(mydata)
summary(mydataClust, parameters=TRUE)

Created by Pretty R at inside-R.org

Tuesday, January 28, 2014

Context Matters When Modeling Human Judgment and Choice

Herbert Simon was succinct when he argued that judgment and choice "is shaped by a scissor whose two blades are the structure of the task environment and the computational capabilities of the actor" (Simon, 1990, p.7). As a marketing researcher, I take Simon seriously and will not write a survey item without specifying the respondent's task and what cognitive processes will be involved in the task resolution.

Thus, when a client asks for an estimate of the proportion of European car rentals made in the United States that will be requests for automatic transmissions, I do not ask "On a scale for 1=poor to 10=excellent, how would you rate your ability to drive a car with a manual transmission?" Estimating one's ability, which involves an implicit comparison with others, does not come close to mimicking the car rental task structure. Nor would I ask for the likelihood of ordering an automatic transmission because probability estimation is not choice. Likelihood will tend to be more sensitive to factors that will never be considered when the task is choice. In addition, I need a context and a price. It probably makes a difference if the rental is for business or personal use, for driving in the mountains or in city traffic, the size of the vehicle, and much more. Lastly, the proportion of drivers capable of using a stick shift increases along with the additional cost for an automatic transmission. Given a large enough incremental price for automatic transmissions, many of us will discover our hidden abilities to shift manually.

The task structure and the cognitive processing necessary to complete the task determine what data need to be collected. In marketing research, the task is often the making of a purchase, that is, the selection of a single option from many available alternatives. Response substitution is not allowed. A ranking or a rating alters the task structure so that we are now measuring something other than what type of transmission will be requested. Different features become relevant when we choose, when we rate, and when we rank the same product configurations. Moreover, the divergence between choice and rating is only increased by repeated measures. The respondent will select the same alternative when minor features are varied, but that respondent will feel compelled to make minor adjustments in their ratings under the same conditions. Task structure elicits different cognitive processing as the respondent solves different problems. Ratings, ranking and choice are three different tasks. Each measures preference as constructed in order to answer the specific question.

Context matters when the goal is generalization, and one cannot generalize from the survey to the marketplace unless the essential task structure has been maintained. For example, I might wish to determine not only what type of car you intend to rent in your next purchase but what you might do over your next ten rentals. Now, we have changed the task structure because car rentals take place over time. We do not reserve our next ten rentals on a single occasion, nor can we anticipate how circumstances will change over time. The "next ten purchases question" seems to be more a measure of intensity than anticipated marketplace behavior.

Nor can one present a subset of available alternatives and ask for the most and least preferred from this reduced list without modifying the task structure and the cognitive processing used to solve the tasks. The alternatives that are available frame the choice task. I prefer A to B until you show me C, and then I decide not to buy anything. Or, adding a more expensive alternative to an existing product configuration increases the purchase of medium priced options by making them seem less expensive. Context matters. When we ignore it, we lose the ability to generalize our research to the marketplace. Finally, self-reports of the importance or contribution of features are not context-free; they simply lack any explicit context so that respondents can supply whatever context comes to mind or they can just chit-chat.

The implications for statistical modeling in R are clear. We begin with a description of the marketplace task. This determines our data collection procedures and places some difficult demands on the statistical model. For example, purchase requires a categorical dependent variable and a considerable amount of data to yield individual estimates. Yet, we cannot simply increase the number of choice sets given to each respondent because repeated measures from the same individual alters that individual's preferences (e.g., price sensitivity tends to increase over repeated exposures to price variation). Bayesian modeling within R allows us to exploit the hierarchical structure within a data set so that we can use data from all the respondents to compensate for our inability to collect much information from any one person. However, borrowing data from others in hierarchical Bayes is not unlike borrowing clothes from others; the sharing works only when the others are exchangeable and come from the same segment with the a common distribution of estimates.

None of this seems to be traditional preference elicitation, where we assume that preference is established and well-formed, requiring only some means for expression. Preference or value is the latent variable responsible for all observed indicators. Different elicitation methods may introduce some unique measurement effects, but they all tap the same latent construct. Simon, on the other hand, sees judgment and decision making as a form of problem solving. Preferences can still be measured, but preferences are constructed as solutions to specific problems within specific task structures. Although preference elicitation is clearly not dead, we can expect to see increasing movement toward context awareness in both computing and marketing.




Friday, January 17, 2014

Metaphors Matter: Factor Structure vs. Correlation Network Maps

The psych R package includes a data set called "bfi" with self-report ratings on 25 personality items along a 6-point agreement scale. All the details are provided in the documentation accompanying the package. My focus is how to represent the correlations among these ratings: factor analysis or network graphics?

Let's start with the correlation network map produced by the R package qgraph. As always, all the R code can be found at the end of this post.


First, we need to discover the underlying pattern, so we will begin by looking for nodes with the highest correlations and thus interconnected with the thickest lines. Red lines indicate negative correlations (e.g., those who claim that they are "indifferent to others" are unlikely to tell us that they "inquire about others" or "comfort others"). Positive correlations are shown in green (e.g., several nodes toward the bottom of the network suggest that those who report "mood swings" and "panic easily" also said that they are easy to anger and irritate). The node "reflect on things" seems to be misplaced, but it is not. The thin red and green lines suggest that it has uniformly low correlations with all the other items, which explain why it is positioned at the periphery but closest to the other four items with which it is the most correlated.

Using this approach, we can identify several regions that are placed near each other because of their interconnections.  For instance, the personal problems mentioned previously and located toward the bottom of the graph are separated from but linked to the measures of introversion ("difficult approach others" and "don't talk"), which in turn have strong negative correlations with extroversion ("makes friends").  As we continue up the graph on the left side, we find an active openness to others that becomes take charge and conscientious. If we continue back down the right side, respondents note what might be called work-related problems. Now, we have our story, and we can see the two-dimensional structure defining the correlation network: internal vs. external and in-control vs. out-of-control.

Next, we can compare this network representation with the more traditional factor model. Why do we observe correlations among observed variables? Correlations are the results of latent variables. We see this in the factor model diagram created using the same data. For example, individuals possess some degree of neuroticism (labeled RC2), therefore the five personal problem items are intercorrelated.  The path coefficient associated with each arrow indicates the correlation between the factor and the observed variable, and the product of the path coefficients for any two observed variables is our estimate of the correlation between those two observed variables.

One should recognize that the two diagrams seek to account for the same correlation matrix. The factor model does so by postulating the presence of unseen forces or latent variables. However, we never observe neuroticism, and we understand that all we have is a pattern of higher correlations among those five self-reports. Without compelling evidence for the independent existence of such a latent variable, we might try to avoid making the reification fallacy and look for a different explanation.

The network model provides an alternative account. Perhaps the best overview of this approach can be found at the PsychoSystems Project. From a network perspective, correlations are observed because the nodes mutually interact. This is not a directed graph attempting to separate cause and effect. It is not a causal model. Perhaps in the beginning, there was a causal connection with one node occurring first and impacting the other nodes. But over time, these nodes have come to mutually support one another so that the unique effects of the self-report ratings can no longer be untangled.

Which of these two representations is better? If the observed variables are true reflections of an underlying trait that can be independently established, then the factor model offers a convenient hierarchical model. We think that we are observed five different things, but in fact, we are measuring five different manifestation of one underlying construct. On the other hand, a network of mutually supportive observations cannot be represented using a factor model. There are no factors, and asserting so ends the discussion prematurely. What are the relationships among the separate nodes? How can one intervene to break the cycle? Are there multiple leverage points? In previous posts, I showed how much can be gained using a network visualization of a key driver analysis and how much can be lost relying solely on an input-output regression model. Besides, why would you not generate the map when, as shown below, R makes it so easy to do?


R code to create the two plots:

library(psych)
data(bfi)
 
ratings<-bfi[,1:25]
names(ratings)<-c(
  "Indifferent of others",
  "Inquire about others",
  "Comfort others",
  "Love children",
  "Make people at ease",
  "Exacting in my work",
  "Until perfect",
  "Do by plan",
  "Do halfway",
  "Waste time",
  "Don't talk",
  "Difficult approach others",
  "Know how to captivate people",
  "Make friends",
  "Take charge",
  "Angry easily",
  "Irritated easily",
  "Mood swings",
  "Feel blue",
  "Panic easily",
  "Full of ideas",
  "Avoid difficult reading",
  "Carry conversation higher",
  "Reflect on things",
  "Not probe deeply"
)
 
fa.diagram(principal(ratings, nfactors=5), main="")
 
library(qgraph)
qgraph(cor(ratings, use="pairwise"), layout="spring",
       label.cex=0.9, labels=names(ratings), 
       label.scale=FALSE)

Created by Pretty R at inside-R.org

Friday, January 10, 2014

Finding the R community a barrier to entry, Python looks elsewhere for lunch

Tal Yarkoni's post on "The homogenization of scientific computing, or why Python is steadily eating other languages' lunch" is an enjoyable read of his transition from R to Python. He makes a good case, and I have no argument with his reasoning or the importance of Python in his work. But my experience has not been the same. I am a methodologist working in marketing. I could have called myself a data analyst in the sense that John Tukey used that term back in his 1962 paper on The Future of Data Analysis. Bill Venables speaks of R in a similar manner and quotes Tukey in his keynote at UseR! 2012, "Statistics work is detective work!" I like that description.

So when I turn to R, I am looking for more than code. "The game is afoot!" I require all the usual tools and perhaps something new or from another field of research. As an example, marketing is concerned with heterogeneity because "one size does not fit all." But every field is concerned with heterogeneity. It's the second moment of a distribution. We refer to it as heterogeneity in marketing, but you might call it variability, variation, dispersion, spread, diversity, or individual differences. There are even more words for the attempt to summarize and explain the second moment: density estimation, finite mixtures, seriation, sorting, clustering, grouping, segmenting, graph cutting, partitioning, and tessellation. R has a package for every term, from many differing points of view, and with more on the way every day.

Detective work borrows whatever assists in the hunt. As a marketing scientist trying to understand customer heterogeneity, R provides everything I need for clustering and finite mixture modeling. Moreover, R contributors provide more than a program, writing some of the best and most insightful papers in the field. However, why restrict myself to traditional approaches to understanding heterogeneity when R includes access to archetypal analysis, item response theory, and latent variable mixture models? These are three very different approaches that I can borrow only because they share a common R language.  It is extremely difficult to learn from fields with a different vocabulary. Even if the underlying math is the same, everything is called by a different name. R imposes constraints on the presentation of the material so that comprehension is still difficult but no longer impossible.

Of course, Python also has a mixture package, and perhaps at some point in the future we will see a Python community that will compete with R. Until then, Python will have to skip lunch.


Monday, January 6, 2014

An Introduction to Statistical Learning with Applications in R

Statistical learning theory offers an opportunity for those of us trained as social science methodologists to look at everything we have learned from a different perspective. For example, missing value imputation can be seen as matrix completion and recommender systems used to fill-in empty questionnaire items that were never shown to more than a few respondents by design. It is not difficult to show how to run the R package softImpute that makes all this happen.  But it can be overwhelming trying to learn about the underlying mechanism in enough detail that you have some confidence that you know what you are doing. One does not want to spend the time necessary to become a statistician, yet we need be aware of when and how to use specific models, and what can go wrong, and what to do when something goes wrong. At least with R, one can run analyses on data sets and work through concrete examples.

The publication of An Introduction to Statistical Learning with Applications in R (download the book pdf) provides a gentle introduction with lots of R code. The book achieves a nice balance and well worth looking at both for the beginner and the more experienced needing to explain to others with less training. As a bonus, Stanford's OpenEdX has scheduled a MOOC by Hastie and Tibshirani beginning in January 21 using this textbook.