Monday, August 3, 2015

Sensemaking in R: A Plenitude of Models Makes for Good Storytelling

"Sensemaking is a motivated, continuous effort to understand connections (which can be among people, places, and events) in order to anticipate their trajectories and act effectively."
Making Sense of Sensemaking 1 (2006)


Story #1: A Tale of Causal Links

A causal model can serve as a sensemaking tool. I have reproduced below a path diagram from an earlier post organizing a set of customer ratings based on their hypothesized causes and effects. As shown on the right side of the graph, satisfaction comes first and loyalty follows with input from image and complaints. Value and Quality perceptions are positioned as drivers of satisfaction. Image seems to be separated from product experience and causally prior. Of course, you are free to disagree with the proposed causal structure. All I ask is that you "see" how such a path diagram can be imposed on observed data in order to connect the components and predict the impact of marketing interventions.


Actually, the nodes are latent variables, and I have not drawn in the measurement model. The typical customer satisfaction questionnaire has many items tapping each construct. In my previous post referenced above, I borrowed the mobile phone dataset from the R package semPLS, where loyalty was assessed with three ratings: continued usage, switching to lower price competitor, and likelihood to recommend. These items are seen as indicators of a commitment and attachment, and their intercorrelations are due to their common cause, which we have labeled as Loyalty.

Where Do Causal Models Come From? The data were collected at one point in time, but it is difficult not to impose a learning sequence on the ratings. That is, the analyst overlays the formation process onto the data as if the measurements were made as learning occurred. Brand image is believed to be acquired first and expectation thought to be formed before the purchase is made. Product experience is understood to come next in the sequence, followed by an evaluation and finally the loyalty decisions to continue using and recommend to others. 

As I argued in the prior post, causation is not in the data because the ratings were not gathered over time. By the time the questionnaire is seen, dissonance has already worked its way backward creating consistencies in the ratings. For instance, when switching is a chore, satisfaction and product perceptions are all higher than they would have been had changing providers been an easier task. In a similar manner, reluctantly recommending only when forced for your opinion may reverse the direction of the arrows and at least temporarily raise all ratings. We shall see in the next section how ratings are interconnected by a network of consumer inferences reflecting not observed covariation but belief and semantics.


Story #2: Living on a One-Dimensional Love-Hate Manifold (Halo Effects)

Our first sensemaking tool, structural equation modeling, was shaped by an intricate plot with many characters playing fixed causal roles. Few believe that this is the only way to make sense of the connections among the different ratings. For some, including myself, the causal model seems a bit too rational. What happened to affect? Halo effects are thought of as a cognitive bias, but all summaries introduce bias measured by the variation about the centroid. In the case of customer satisfaction and loyalty, a pointer on a single evaluative dimension can reproduce all the ratings. You tell me that you are very satisfied with your mobile phone provider, and I can predict that you are not dropping a lot of calls.

The halo effect functions as a form of data comprehension. We learn what constitutes a "good" product or service before we buy. These are the well-formed performance expectations that serve as the tests for deciding satisfaction. We are upset when the basic functions that are must-haves are not delivered (e.g., failure of our mobile phone to pair with the car's Bluetooth), and we are delighted when extras are included that we did not expect (e.g., responsive customer support). Most of these expectations lie just below awareness until experienced (e.g., breakage and repair costs when dropped short distance or onto relatively soft surface).

This representation orders features and services as milestones along a single dimension so that one can read one's overall satisfaction from their position along this path. You may be familiar with the usage of such sensemaking tools in measuring achievement (e.g., spelling ability is assessed by the difficulty of words that one can spell) or political ideology (e.g., a legislator's position along the liberal-conservative continuum depends on the bills voted for and against). Thus, I assess your spelling ability by the difficulty of the words you can spell. I determine how liberal or conservative you are by the issues you support or oppose. And I evaluate brands and their products by the features and services they are able to provide. We simply reanalyze the same customer satisfaction rating data. The graded response model from the ltm R package will order both customers and the rating items along the same latent satisfaction dimension, as shown in my post Item Response Modeling of Customer Satisfaction.

Perhaps you noticed that we have changed our perspective or shifted to a new paradigm. Feature ratings are no longer drivers of satisfaction, instead they have become indicators of satisfaction. In Story #1, a Tale of Causal Links, the arrows go from the features to satisfaction and loyalty. Driver analysis accumulates satisfaction feature by feature with each adding a component to the overall reservoir of goodwill. However, in Story #2 all the ratings (features, satisfaction, and loyalty) fall along the same evaluative continuum from rage to praise. We can still display the interrelationship with a diagram, thought we need to drop the arrows for everything is interconnected in this network.

The manifold from Story #2 makes sense of the data by ranking features based on performance expectations. Some features and services are basic and everyone scores well. The premium features and services, on the other hand, are those not provided by every product. Customers decide what they want and are willing to pay, and then they assess the ability of the purchased product to deliver. This is not a driver analysis for the assessment of each component is not independent of the other components.

Those of us willing to live with the imperfections of our current product tend to rate the product higher in a backward adjustment from loyalty to feature performance. You do something similar when you determine that switching is useless because all the competitors are the same. Can I alter your perceptions by tempting you with a $100 bonus or a free month of service to recommend a friend? It's a network of jointly determined nodes with a directionality represented by the love-hate manifold. The ability to generate satisfaction or engender loyalty is but another node, different from product quality perceptions, yet still part of the network.

How else can you explain how randomly attaching a higher price to a bottle of wine yields higher ratings for taste? Price changes consumer perceptions of quality because consumers make inferences about uncertain features based on what they know about more familiar features. When asked about customer support, you can answer even if you have never contacted or used customer support. You simply fill in a rating with an inference from other features with which you are more familiar or you simply assume it must be good or bad because you are happy or unhappy overall. Such a network analysis can be done with R, as can the driver analysis from our first story.

Wednesday, July 29, 2015

But I Don't Want to Be a Statistician!

"For a long time I have thought I was a statistician.... But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.... All in all, I have come to feel that my central interest is in data analysis...."

Opening paragraph from John Tukey "The Future of Data Analysis" (1962)


To begin, we must acknowledge that these labels are largely administrative based on who signs your paycheck. Still, I prefer the name "data analysis" with its active connotation. I understand the desire to rebrand data analysis as "data science" given the availability of so much digital information. As data has become big, it has become the star and the center of attention.

We can borrow from Breiman's two cultures of statistical modeling to clarify the changing focus. If our data collection is directed by a generative model, we are members of an established data modeling community and might call ourselves statisticians. On the other hand, the algorithmic modeler (although originally considered a deviant but now rich and sexy) took whatever data was available and made black box predictions. If you need a guide to applied predictive modeling in R, Max Kuhn might be a good place to start.

Nevertheless, causation keeps sneaking in through the back door in the form of causal networks. As an example, choice modeling can be justified as an "as if" predictive modeling but then it cannot be used for product design or pricing. As Judea Pearl notes, most data analysis is "not associational but causal in nature."

Does an inductive bias or schema predispose us to see the world as divided into causes and effects with features creating preference and preference impacting choice? Technically, the hierarchical Bayes choice model does not require the experimental manipulation of feature levels, for example, reporting the likelihood of bus ridership for individuals with differing demographics. Even here, it is difficult not be see causation at work with demographics becoming stereotypes. We want to be able to turn the dial, or at least selection different individuals, and watch choices change. Are such cognitive tendencies part of statistics?

Moreover, data visualization has always been an integral component in the R statistical programming language. Is data visualization statistics? And what of presentations like Hans Rosling's Let My Dataset Change Your Mindset? Does statistics include argumentation and persuasion?

Hadley Wickham and the Cognitive Interpretation of Data Analysis

You have seen all of his data manipulation packages in R, but you may have missed the theoretical foundations in the paper "A Cognitive Interpretation of Data Analysis" by Grolemund and Wickham. Sensemaking is offered as an organizing force with data analysis as an external tool to aid understanding. We can make sensemaking less vague with an illustration.

Perceptual maps are graphical displays of a data matrix such as the one below from an earlier post showing the association between 14 European car models and 27 attributes. Our familiarity with Euclidean spaces aid in the interpretation of the 14 x 27 association table. It summarizes the data using a picture and enables us to speak of repositioning car models. The joint plot can be seen as the competitive landscape and soon the language of marketing warfare brings this simple 14 x 27 table to life. Where is the high ground or an opening for a new entry? How can we guard against an attack from below? This is sensemaking, but is it statistics?



I consider myself to be a marketing researcher, though with a PhD, I get more work calling myself a marketing scientist. I am a data analyst and not a statistician, yet in casual conversation I might say that I am a statistician in the hope that the label provides some information. It seldom does.

I deal in sensemaking. First, I attempt to understand how consumers make sense of products and decide what to buy. Then, I try to represent what I have learned in a form that assists in strategic marketing. My audience has no training in research or mathematics. Statistics plays a role and R helps, but I never wanted to be a statistician. Not that there is anything wrong with that.

Monday, July 27, 2015

Statistical Models of Judgment and Choice: Deciding What Matters Guided by Attention and Intention

Preference begins with attention, a form of intention-guided perception. You enter the store thirsty on a hot summer day, and all you can see is the beverage cooler at the far end of the aisle with your focus drawn toward the cold beverages that you immediately recognize and desire. Focal attention is such a common experience that we seldom appreciate the important role that it plays in almost every activity. For instance, how are you able to read this post? Automatically and without awareness, you see words and phrases by blurring everything else in your perceptual field.


Similarly, when comparing products and deciding what to buy, you construct a simplified model of the options available and ignore all but the most important features. Selective attention simultaneously moves some aspects into the foreground and pushes everything else into the background, such is the nature of perception and cognition.

[Note: see "A Sparsity-Based Model of Bounded Rationality" for an economic perspective.]

Given that seeing and thinking are sparse by design, why not extend that sparsity to the statistical models used to describe human judgment and decision making? That cooler mentioned in the introductory paragraph is filled with beverages that fall into the goal-derived category of "things to drink on a hot summer day" and each has its own list of distinguishing features. The statistical modeling task begins with many options and even more distinguishing features so that the number of potential predictors is large. However, any particular individual selectively attends to only a small subset of products and features. This is what we mean by sparse predictive models - many variables in the equation but only a few with nonzero coefficients.

[Note: In order not to get lost in two different terminologies, one needs to be careful not to confuse sparse models with most parameters equal to zero and sparse matrices, which deals with storing and manipulating large data sets with lots of cells equal to zero.]

Statistical Learning with Sparsity

A direct approach might "bet on sparsity" and argue that only a few coefficients can be nonzero given the limitations of human attention and cognition. The R package glmnet will impose a budget on the total costs incurred from paying attention to many features when making a judgment or choice. Thus, with a limited span of attention, we would expect to be able to predict individual responses with only the most important features in the model. The modeler varies a tuning parameter controlling the limits of one's attention and watches predictors enter and leave the equation.

If everyone adopted the same purchase strategy, we could observe the purchase behavior of a group of customers and estimate a single set of parameters using glmnet. Instead of uniformity, however, we are more likely to find considerable heterogeneity with a mixture of different segments and substantial variation within each segment. All that is necessary to violate homogeneity is for the product category to have a high and low end, which is certainly the case with cold beverages. Now, the luxury consumer and the price sensitive will attend to different portions of the retail shelf and require that we be open to the possibility that our data are a mixture of different preference equations. Willingness to spend, of course, is but one of many possible ways of dividing up the product category. We could easily continue our differentiation of the cold beverage market by adding dimensions partitioning the cooler on the basis of calories, carbonation, coffees and teas, designer waters, alcohol content, and more.

Fragmented markets create problems for all statistical models assuming homogeneity and not just glmnet. Attention, the product of goal-directed intention, generates separated communities of consumers with awareness and knowledge of different brands and features within the same product category. The high-dimensional feature space resulting from the coevolving network of customer wants and product offerings forces us to identify a homogeneous consumer segment before fitting glmnet or any other predictive model. What matters in judgment and choice depends on where we focus our attention, which follows from our intentions, and intentions vary between individuals and across contexts (see Context Matters When Modeling Human Judgment and Choice).

Preference is constructed by the individual within a specific context as an intention to achieve some desired end state. Yet, the preference construction process tends to produce a somewhat limited result. A security camera placed by the rear beverage cooler would record a quick scan, followed by more activity as the search narrowed with a possible reset and another search begun or terminated without purchase. The beverage company has spent considerable money "teaching" you about their brand and the product category. You know what to look for before you enter the store because, as a knowledgeable consumer, you have learned a brand and product category representation and you have decided on an ideal positioning for yourself within this representation. For example, you know that beverages can be purchased in different size glass or plastic bottles, and you prefer the 12-ounce plastic bottle with a twist top.

Container preferences are but one of the building blocks acquired in order to complete the beverage purchase process. We can identify all the building blocks using nonnegative matrix factorization (NMF) and use that information to cluster consumers and features simultaneously. This is how we discover which consumers quickly find the regular colas and decide among the brands, types, favors, sizes, and containers available within this subcategory. Finally, we have our relatively homogenous dataset of regular cola drinkers for the glmnet R package to analyze. More accurately, we will have separate datasets for each joint customer-feature block and will need to derive different equations with different variables for different consumer segments.

Tuesday, July 21, 2015

"Models, Models Everywhere!" Brought to You by R

Statistical software packages sell solutions. If you go to the home page for SAS, they will tell you upfront that they sell products and solutions. They link both together under the first tab just below "The Power to Know" mantra. SPSS separates product and solution into separate tabs, but places both next to each other on its home page as the first and second clicks. Obviously, both companies are in the solutions business; you have a problem, they have a solution. It's a good positioning to attract customers who are overworked and over their heads. To be clear, no one is questioning the analytics. SPSS and SAS are not selling snake oil, but they are selling something that is designed to appeal to potential customers with more money than time to spend.

R, on the other hand, appeals to the analyst looking outside the traditional box filled with a limited set of statistical models that keep us collecting the same data year after year and running the same analysis each time. My example comes from marketing research where we are repeatedly asked to do "something multivariate" with ratings of idealized features (e.g., cost without price points, quality lacking any specifications, and customer service stripped of context). Before you propose to replace the rating with some ranking task (e.g., MaxDiff), let me remind you that the problem is not the rating but the abstract feature without referent.

The solution is to get concrete if only our analytic tools did not lag behind our data collection capabilities. With decontextualized features we could pretend that we were all on the same page and speaking of the same thing. The details, however, reveal the heterogeneity of product usage and experience. The global space defined by price, quality and service becomes parallel worlds with concentrations of customers paying different amounts for product versions of varying quality with diverging expectations and needs for service. I have many more variables and even more missing data. More importantly, I have non-overlapping customer-feature blocks accompanying each community held together by common usage occasions.

This characterization of the data as local places within a global space came, not from marketing research, but from matrix factorization techniques for recommender systems. Modeling preferences for movies and songs have altered the way we look at all consumption. Everything has become more complex. The traditional clustering models started with feature selection and one set of variables for everyone. Similarly, although factorial invariance across distinct populations might require some preliminary examination, we believed that ultimately we would be able to identify a common group of respondents with which we could perform all dimension reduction. After Netflix and Spotify, all we can see are niche-genre pairings of customers and product features.


Of course, all of this is brought to you by R. SAS and SPSS need a business model before they incorporate the latest procedures. R, on the other hand, provides a platform for innovation by others, academics and entrepreneurs, willing to share and promote their best work. The result is a continuous stream of new ways of seeing and thinking embedded in a diverse collection of models and algorithms, which we call R packages. You can find a listing of all the innovative approaches for jointly blocking the rows and columns of a data matrix under the heading "Simultaneous Clustering in R" in my post The Ecology of Data Matrices.

Models are everywhere and from everywhere. R provides the interface enabling us to lift our heads out of our box and peak into the box down the road in someone else's field.

Wednesday, July 15, 2015

Seeing Data as the Product of Underlying Structural Forms

Matrix factorization follows from the realization that nothing forces us to accept the data as given. We start with objects placed in rows and record observations on those objects arrayed along the top in columns. Neither the objects nor the measurements need to be preserved in their original form.

It is helpful to remember that the entries in our data matrix are the result of choices made earlier for everything that can be recorded is not tallied. We must decide on the unit of analysis (What objects go in the rows?) and the level of detail in our measurements (What variables go in the columns?). For example, multilevel data can be aggregated as we deem appropriate, so that our objects can be classroom means rather than individual students nested within classrooms and our measures can be total number correct rather than separate right and wrong for each item on the test.

Even without a prior structure, one could separately cluster the rows and the columns using two different distance matrices, that is, a clustering of the rows/columns with distances calculated using the columns/rows. As an example, we can retrieve a hierarchical cluster heat map from an earlier post created using the R heatmap function.


Here, yellow represents a higher rating, and lower ratings are in red. Objects with similar column patterns would be grouped together into a cluster and treated as if they constitute a single aggregate object. Thus, when asked about technology adoption, there appears to be a segment toward the bottom of the heatmap who foresee a positive return to investment and are not concerned with potential problems. The flip side appears toward the top with the pattern of higher yellows and lower reds suggesting more worries than anticipated joys. The middle rows seems to contain individuals falling somewhere between these two extremes.

A similar clustering could be performed for the columns. Any two columns with similar patterns down the rows can be combined and an average score calculated. We started with 8 columns and could end with four separate 2-column clusters or two separate 4-column clusters, depending on where we want to draw the line. A cutting point for the number of row clusters seems less obvious, but it is clear that some aggregation is possible. As a result, we have reordered the rows and columns, one at a time, to reveal an underlying structure: technology adoption is viewed in terms of potential gains and losses with individuals arrayed along a dimension anchored by gain and loss endpoints. 

Before we reach any conclusion concerning the usefulness of this type of "dual" clustering, we might wish to recall that the data come from a study of attitudes toward technology acceptance with the wording sufficiently general that every participant could provide ratings for all the items. If we had, instead, asked about concrete steps and concerns toward actual implementation, we might have found small and large companies living in different worlds and focusing on different details. I referred to this as local subspaces in a prior post, and it applies here because larger companies have in-house IT departments and IT changes a company's point-of-view.

To be clear, conversation is filled with likes and dislikes supported by accompanying feelings and beliefs. This level of generality permits us to communicate quickly and easily. The advertising tells you that the product reduces costs, and you fill in the details only to learn later that you set your cost-savings expectations too high.

We need to invent jargon in order to get specific, but jargon requires considerable time and effort to acquire, a task achieved by only a few with specialized expertise (e.g., the chief financial officer in the case of cost cutting). The head of the company and the head of information technology may well be talking about two different domains when each speaks of reliability concerns. If we want the rows of our data matrix to include the entire range of diverse players in technological decision making while keeping the variables concrete, then we will need a substitute for the above "dual" scaling.

We may wish to restate our dilemma. Causal models, such as the following technology acceptance model (TAM), often guide our data collection and analysis with all the inputs measured as attitude ratings.


Using a data set called technology from the R package plspm, one can estimate and test the entire partial least squares path model. Only the latent variables have been shown in the above diagram (here is a link to a more complete graphic), but it should be clear from the description of the variables in the technology data set that Perceived Usefulness (U) is inferred from ratings of useful, accomplish quickly, increase productivity and enhance effectiveness. However, this is not a model of real-world adoption that depends on so many specific factors concerning quality, cost, availability, service, selection and technical support. The devil is in the details, but causal modeling struggles with high dimensional and sparse data. Consequently, we end up with a model of how people talk about technology acceptance and not a model of the adoption process pushed forward by its internal advocates and slowed by its sticking points, detours and dead ends.

Yet, the causal model is so appealing. First comes perception and then follows intention. The model does all the heavy lifting with causation asserted by the path diagram and not discovered in the data. All I need is a rating scale and some general attitude statements that everyone can answer because they avoid any specifics that might result in a DK (don't know) or NA (not applicable). Although ultimately the details are where technology is accepted or rejected, they just do not fit into the path model. We would need to look elsewhere in R for a solution.

As I have been arguing for some time in this blog, the data that we wish to collect ought to have concrete referents. Although we begin with detailed measures, it is our hope that the data matrix can be simplified and expressed as the product of underlying structural forms. At least with technology adoption and other evolving product categories, there seems to be a linkage between product differentiation and consumer fragmentation. Perhaps surprisingly, matrix factorization with its roots as a computational routine from linear algebra seems to be able to untangle the factors responsible for such coevolving networks.

Thursday, July 9, 2015

The Ecology of Local Subspaces: Mixtures of Parochial Views

No matter where you live, your view of the world is biased and limited, which is the beauty of this magazine cover.


As a marketer, of course, all my maps depict, not place, but consumption. For example, in an earlier post I asked, "What apps are on your Smartphone?" It seemed like a reasonable question for a marketing researcher interested in cross-selling or just learning more about product usage. Nothing special about apps, the same question could have been raised about movies, restaurants, places you drive your car, or stuff that you buy using your credit card. In all these cases, there is more available than any one consumer will use or know. Regardless of what or how we consume, our awareness and familiarity will have the parochial feel of the New Yorker cartoon.

Not Really Big Data But Too Much for a Single Bite

Because we are surveying consumers, we are unlikely to collect data from more than a few thousand respondents, and there are limits to how much information we can obtained before they stop responding. This is not Netflix ratings or Facebook friending or Google searches. Still, the "world of apps" for the young teen in school and the business person using their phone for work are too distant to analyze in a single bite. Moreover, these worldviews cannot be disentangled by examining marginal row or column effects alone for they are a mixture of local subspaces defined by the joint clustering of rows and columns together.

I have titled this post "The Ecology of Local Subspaces" in the hope that our knowledge of ecological systems will help us understand consumption. Animals are mobile, yet the features that enable the movement of fish, birds, and mammals tend to be distinct since they inhabit different subspaces. Hoofed animals are not more or less similar because they do not have wings or fins. Two individuals are similar if they have the same apps on their Smartphones only after we have identified the local subspace of relevant apps that define their similarity. Same and different have meaning only within a given frame of reference. Different apps will enter the similarity computations for those with differing worldviews determined by their situation and usage patterns. An earlier post provides a more detail account and suggests several R packages for simultaneous clustering.

Matrix Factorization Yields a Single Set of Joint Latent Factors 

I have already rejected the traditional approach of marginal row and column clustering. That is, one could cluster the rows using all the columns to calculate distances, either a distance matrix for all the rows or distances to proposed cluster centroids as in k-means clustering. Nothing forces us to make these distances Euclidean, and there are other distance measures in R. The same clustering could be attempted for the columns, although some prefer a factor analysis if they can obtain a reasonable correlation matrix given the columns are binary and sparse.

Our focus, however, is on the matrix whose block pattern is determined by the joint action of the rows and columns together. We seek a single set of joint latent factors with row contributions and column weights that reproduce the data matrix when their product is formed. What forces are at work in the Smartphone apps market that might generate such local subspaces? One could see our joint latent factors as the result of the need to accomplish desired tasks. Then, the apps co-reside on the same Smartphone because together they accomplish those desired tasks, and users who want to perform those tasks receive high scores on the corresponding joint latent factor.

Suppose that we start with 1000 respondents and 100 apps to form a 1000 x 100 data matrix with 100,000 data points (1000 x 100 = 100,000). Finding 10 joint latent factors would mean that we could approximate the 100,000 data points by multiplying a 1000 x 10 user factor score matrix times a 10 x 100 apps factor loading matrix. The matrix multiplication would look like the following diagram (except for more rows and columns for all three matrices) with W containing the 10 factor scores for the 1000 users and H holding the 10 sets of factor loadings for the 100 apps.


The data reduction is obvious given that before I needed 100 apps indicators to describe a user, now I need only 10 joint latent factor scores. Clearly, in order to reproduce the original data matrix, we will require the factor loadings telling us which apps load on which joint latent factors.

More importantly, we will have made the representation simpler by naming the joint latent factors responsible for the data reduction. Apps are added for a purpose, in fact, multiple apps may be installed together or over time to achieve a common goal. The coefficient matrix H displays those purposes as the joint latent factors and the apps that serve those purposes as the factor loadings in each row. The simplification is complete with W that describes the users in terms of those purposes or joint latent factors and not by a listing of the original 100 apps. 

Our hope is to discover an underlying structure revealing the hidden forces generating the observed usage patterns across diverse user communities.

Monday, July 6, 2015

Regression with Multicollinearity Yields Multiple Sets of Equally Good Coefficients

The multiple regression equation represents the linear combination of the predictors with the smallest mean-squared error. That linear combination is a factorization of the predictors with the factors equal to the regression weights. You may see the words "factorization" and "decomposition" interchanged, but do not be fooled. The QR decomposition or factorization is the default computational method for the linear model function lm() in R. We start our linear modeling by attempting to minimize least square error, and we find that a matrix computation accomplishes this task fast and accurately. Regression is not unique for matrix factorization is a computational approach that you will rediscover over and over again as you add R packages to your library (see Two Purposes for Matrix Factorization).

Now, multicollinearity may make more sense. Multicollinearity occurs when I have included in the regression equation several predictors that share common variation so that I can predict any one of those predictors from some linear combination of the other predictors (see tolerance in this link). In such a case, it no longer matters what weights I give individual predictors for I get approximately the same results regardless. That is, there are many predictor factorizations yielding approximately the same predictive accuracy. The simplest illustration is two highly correlated predictors for which we obtain equally good predictions using any one predictor alone or any weighted average of the two predictors together. "It don't make no nevermind" for the best solution with the least squares coefficients is not much better than the second best solution or possibly even the 100th best solution. Here, the "best" solution is defined only for this particular dataset before we ever begin to talk about cross-validation.

On the other hand, when all the predictors are mutually independent, we can speak unambiguously about the partitioning of R-squared. Each independent variable makes its unique contribution, and we can simply add their impacts for the total is truly the sum of the parts. This is the case with orthogonal experimental designs where one calculates the relative contribution of each factor, as one does in rating-based conjoint analysis where the effects are linear and additive. However, one needs to be careful when generalizing from rating-based to choice-based conjoint models. Choice is not a linear function of utility so that the impact on share from changing any predictor depends on the values of all the predictors, including the predictor being manipulated. Said differently, the slope of the logistic function is not constant but varies with the values of the predictors.

We will ignore nonlinearity in this post and concentrate on non-additivity. Our concern will be the ambiguity that enters when the predictors are correlated (see my earlier post on The Relative Importance of Predictors for a more complete presentation).

The effects of collinearity are obvious from the formula calculating R-squared from the cells of the correlation matrix between y and the separate x variables. With two predictors, as shown below by the subscripts 1 and 2, we see that R-squared is a complex interplay of the separate correlations of each predictor with y and the interrelationships among the predictors. Of course, everything simplifies when the predictors are independent with r(1,2)=0 and the numerator reducing to the sum of the squared correlations of each predictor with y divided by a denominator equal to one.


The formulas for the regression coefficients mirror the same "adjustment" process. If the correlation between the first predictor and y represents the total effect of the first variable on y, then the beta weight shows the direct effect of the first variable after removing the its indirect path through the second predictor. Again, when the predictors are uncorrelated, the beta weight equals the correlation with y.


We speak of this adjustment as controlling for the other variables in the regression equation. Since we have only two independent variables, we can talk of the effect of variable 1 on y controlling for variable 2. Such a practice seems to imply that the meaning of variable 1 has not been altered by the controlling for variable 2. We can be more specific by letting variable 1 be a person's weight, variable 2 be a person's height and the dependent variable be some measure of health. What is the contribution of weight to health controlling for height? Wait a second, weight controlling for height is not the same variable as weight. We have a term for that new variable; we call it obesity. Simply stated, the meaning of a term changes as move from the marginal (correlations) to conditional (partial correlations) representations.

None of this is an issue when our goal is solely prediction. Yet, the human desire to control and explain is great, and it is difficult to resist the temptation to jump from association to causal inference. The key is not to accept the data as given but to search for a representation that enables us to estimate additive effects. One alternative treats observed variables as the bases for latent variable regression in structural equation modeling. Another approach, nonnegative matrix factorization (NMF), yields a representation in terms of building blocks that can additively be combined to form relatively complex structures. The model does not need to be formulated as a matrix factorization problem in order for these computational procedures to yield solutions.