Pages

Monday, September 29, 2014

TURF Analysis: A Bad Answer to the Wrong Question

Now that R has a package performing Total Unduplicated Reach and Frequency (TURF) Analysis, it might be a good time to issue a warning to all R users. DON'T DO IT!

The technique itself is straight out of media buying from the 1950s. Given some number of n alternative advertising options (e.g., magazines), which set of size k will reach the most readers and be seen the most often? Unduplicated reach is the primary goal because we want everyone in the target audience to see the ad. In addition, it was believed that seeing the ad more than once would make the ad more effective (that is, until wearout), which is why frequency is a component. When TURF is used to create product lines (e.g., flavors of ice cream to carry given limited freezer space), frequency tends to be downplayed and the focus placed on reaching the largest percentage of potential customers. All this seems simple enough until one looks carefully at the details, and then one realizes that we are interpreting random variation.

The R package turfR includes an example showing how to use their turf() function by setting n to 10 and letting k range from 3 to 6.

library(turfR)
data(turf_ex_data)
ex1 <- turf(turf_ex_data, 10, 3:6)
ex1
Created by Pretty R at inside-R.org

This code produces a considerable amount of output. I will show only the first 10 best triplets from the 120 possible sets of three that can be formed from 10 alternatives. The rchX columns tells the weighted proportion of the 180 individuals in the dataset that would buy one of the 10 products listed in the columns labeled with integers from 1 to 10. Thus, according to the first row, 99.9% would buy something if Items 8, 9, and 10 were offered for sale.

combo
rchX
frqX
1
2
3
4
5
6
7
8
9
10
1
120
0.998673
2.448993
0
0
0
0
0
0
0
1
1
1
2
119
0.998673
2.431064
0
0
0
0
0
0
1
0
1
1
3
99
0.995773
1.984364
0
0
0
1
0
0
0
1
0
1
4
110
0.992894
2.185398
0
0
0
0
1
0
0
0
1
1
5
64
0.991567
1.898693
0
1
0
0
0
0
0
0
1
1
6
109
0.990983
2.106944
0
0
0
0
1
0
0
1
0
1
7
97
0.99085
1.966436
0
0
0
1
0
0
1
0
0
1
8
116
0.989552
2.341179
0
0
0
0
0
1
0
0
1
1
9
85
0.989552
2.042792
0
0
1
0
0
0
0
0
1
1
10
36
0.989552
1.800407
1
0
0
0
0
0
0
0
1
1

The sales pitch for TURF depends on showing only the "best" solution for 3 through 6. Once we look down the list, we find that there are lots of equally good combinations with different products (e.g., the combination in the 7th position yields 99.1% reach with products 4, 7 and 10). With a sample size of 180, I do not need to run a bootstrap to know that the drop from 99.9% to 99.1% reflects random variation or error.

Of course, the data from turfR is simulated, but I have worked with many clients and many different datasets across a range of categories and I have never found anything but random differences among the top solutions. I have seen solutions where the top several hundred combinations cannot be distinguished based on reach, which is reasonable given that the number of combinations increases rapidly with n and k (e.g., the R function choose(30,5) indicates that there are 142,506 possible combinations of 30 things in sets of 5). You can find an example of what I see over and over again by visiting the TURF website for XLSTAT software.

Obviously, there is no single best item combination that dominates all others. It could have been otherwise. For example, it is possible that the market consists of distinct segments with each wanting one and only one item.

With no overlapping in this Venn diagram, it is clear that vanilla is the best single item, followed by vanilla and chocolate as the best pair, and so on had there been more flavors separated in this manner.

However, consumer segments are seldom defined by individual offerings in the market. You do not stop buying toothpaste because your brand has been discontinued. TURF asks the wrong question because consumer segmentation is not item-based.

As a quick example, we can think about credit card reward programs with its categories covering airlines, cash back, gas rebates, hotel, points, shopping and travel. Each category could contain multiple reward offers. A TURF analysis would seek the best individual rewards ignoring the categories. Yet, comparison websites use categories to organize searches because consumer segments are structured around the benefits offered by each category.

The TURF Analysis procedure from XLSTAT allows you to download an Excel file with purchase intention ratings for 27 items from 185 respondents. A TURF analysis would require that we set a cutoff score to transform the 1 through 5 ratings into a 0/1 binary measure. I prefer to maintain the 5-point scale and treat purchase intent as an intensity score after subtracting one so that the scale now ranges from 0=not at all to 4=quite sure. A nonnegative matrix factorization (NMF) reveals that the 27 items in the columns fall into 8 separable row categories marked by the red indicating a high probability of membership and yellow with values close to zero showing the categories where the product does not belong.

The above heatmap displays the coefficients for each of the 27 products, as the original Excel file names them. Unfortunately, we have only the numbers and no description of the 27 products. Still, it is clear that interest has an underlying structure and that perhaps we ought to consider grouping the products based on shared features, benefits or usages. For example, what do Products 5, 6 and 17 clustered together at the end of this heatmap have in common? Understand, we are looking for stable effects that can be found in the data and in the market where purchases are actually made.

The right question asks about consumer heterogeneity and whether it supports product differentiation. Different product offerings are only needed when the market contains segments seeking different benefits. Those advocating TURF analysis often use ice cream flavors as their example, as I did in the above Venn diagram. What if the benefit driving sales of less common flavors was not the flavor itself but the variety associated with a new flavor or a special occasion when one wants to deviate from their norm? A segmentation, whether NMF or another clustering procedure, would uncover a group interested in less typical flavors (probably many such flavors). This is what I found from the purchase history of whiskey drinkers, a number of segments each buying one of the major brands and a special occasion or variety seeking segment buying many niche brands. All of this is missed by a TURF analysis that gives us instead a bad answer to the wrong question.

Appendix with R Code needed to generate the heatmap:

First, download the Excel file, convert it to csv format, and set the working directory to the location of the data file.

test<-read.csv("demoTurf.csv")
library(NMF)
fit<-nmf(test[,-1]-1, 8, method="lee", nrun=20)
coefmap(fit)

Created by Pretty R at inside-R.org

Saturday, September 27, 2014

Recognizing Patterns in the Purchase Process by Following the Pathways Marked By Others

Herbert Simon's "ant on the beach" does not search for food in a straight line because the environment is not uniform with pebbles, pools and rough terrain. At least the ant's decision making is confined to the 3-dimensional space defining the beach. Consumers, on the other hand, roam around a much higher dimensional space in their search for just the right product to buy.

Do you search online or shop retail? Do you go directly to the manufacturer's website or do you seek out professional reviews or user ratings? Does YouTube or social media hold the key? Similar decisions must be made for physical searches of local retailers and superstores?  Of course, embedded within each of these decision points are more choices concerning features, servicing and price.

Yet, we do not observe all possible paths in the consumer purchase journey. Like the terrain of the beach, the marketplace makes some types of searches easier than others. In addition, like the ant, the first consumers leave trails that later consumers can follow. This can be direct word of mouth or indirect effects such as internet searches where the websites shown first depend on the number of previous visits. But it can also be marketing messaging and expert reviews, that is, markers along the trail telling us what to look for and where to look. We are social creatures, and it is fascinating to see how quickly all the possible paths through the product offerings are narrowed down to several well-worn trails that we all follow. Culture impacts what and how we buy, and statistical modeling that incorporates what others are doing may be our best hope of discovering those pathways.

In order to capture everyone in the product market and all possible sources of information, we require a wide net with fine webbing. Our data matrix will contain heterogeneous rows of consumers with distinctive needs who are seeking very different benefits. Moreover, our columns must be equally diverse to span everywhere that a consumer can search for product information. As a result, we can expect our data matrix to be sparse for we have included many more columns of information sources than any one consumer would access.

To make sense of such a data matrix, we will require a statistical model or algorithm that reflects this construction process, by which I mean the social and cultural grouping of consumers who share a common understanding of what is important to know and where one should seek such information. For example, someone looking for a new credit card could search and apply solely online, but not any consumer, for some do not shop on the internet or feel insecure without the presence of a physical building close to home. Those wanting to apply in-person may wait for a credit card offer to be inserted in their monthly bank statement or they may see an advertisement in the local newspaper.

Modeling the Joint Separation of Consumers and Their Information Sources

Nonnegative matrix factorization (NMF) decomposes the nonnegative data matrix into the product of two other nonnegative matrices, one for consumers and the other for information sources. The goal is dimension reduction. Before NMF, we needed all p columns of the data matrix to describe the consumer. Now, we can get by with only the r latent features, where r is much smaller than p. What are these latent features? They are defined in the same manner as the factors in factor analysis. Our second matrix from the nonnegative factorization contains coefficients that can be interpreted as one would factor loadings. We look for the information sources with the largest weights to name the latent feature.

Returning to our credit card example, the data matrix includes rows for consumers banking online and in-person plus columns for online search along with columns for direct mail and newspaper ads. Online banking customers use online information sources, while in-person banking customers can be found looking for information in a different cluster of columns. We have separation with online row and columns forming one block and in-person rows and columns coming together in a separate block.

The nonnegativity of the two product matrices enables such a "parts-based" representation with the simultaneous clustering of both rows and columns. We start with the observed data matrix. It is nonnegative so that zero indicates none and a larger positive value suggest more of whatever is being measured. Counts or frequencies of occurrence would work. Actually, the data matrix can contain any intensity measure. Hopefully, you can visualize that the data matrix will be more sparse (more zeros) with greater separation between the row-column blocks, and in turn, this sparsity will be associated with corresponding sparsity in the two product matrices.

A toy example might help with this explanation.

V1
V2
V3
V4
S1
6
3
0
0
S2
4
2
0
0
S3
2
1
0
0
S4
0
0
6
3
S5
0
0
4
2
S6
0
0
2
1

The above data matrix shows the intensity of search scores from 0 (no search) to 6 (intense search) for six consumers across four different information sources. What might have produced such a pattern? The following could be responsible:
  • Online sources in the first two columns with V1 more popular than V2,
  • Offline sources in the last two columns with V3 more popular than V4,
  • Online customers in the first three rows with individual search intensity S1 > S2 > S3, and
  • Offline customers in the last three rows with individual search intensity S4 > S5 > S6.
The pattern might seem familiar as row and column effects from an analysis of variance. The columns form a two-level repeated measures factor with V1 and V2 nested in the first level (online) and V3 and V4 in the second level (offline). Similarly, the rows fall into two levels of a between-subject factor with the first three rows nested in level one (online) and the last three rows in level two (offline). Biclustering algorithms approach the problem in this manner (e.g., the R package biclust). Matrix factorization achieves a similar outcome by factoring the data matrix into the product of two new matrices with one representing row effects and the other column effects.

The NMF R package decomposes the data matrix into the two components that are believed to have generated the data in the first place. In fact, I created the data matrix as a matrix product and then use NMF to retrieve the generating matrices. The R code is given at the end of this post. The matrices W and H, below, reflect the above four bullet points. When these two matrices are multiplied, their product W x H is the above data matrix (e.g., the first entry in the data matrix is 3x2+0x0=6).

W
R1
R2
H
V1
V2
V3
V4
S1
3
0
R1
2
1
0
0
S2
2
0
R2
0
0
2
1
S3
1
0
S4
0
3
S5
0
2
S6
0
1

As expected, when we run the nmf() function with rank r=2 on this data matrix, we get these two matrices back again with W as the basis and H as the coefficient matrix. Actually, because W and H are multiplied, we might find that every element in W is divided by 2 and every element in H is multiplied by 2, which would yield the same product. Looking at the weights in H, one concludes that R1 taps online information sources, leaving R2 as the offline latent feature. If you wished to standardize the weights, all the coefficients in a row could be transformed to range from 0 to 1 by dividing by the maximum value in that row.

Decompositions such as NMF are common in statistical modeling. Regression analysis in R using the lm() function is performed as a QR decomposition. The singular value decomposition (SVD) underlies much of principal component analysis. Nothing usual here, except for the ability of NMF to thrive when the data are sparse.

To be clear, sparsity is achieved when we ask about the details of consumer information search. Such details enable management to make precise changes in their marketing efforts. As important, detailed probes are more likely to retrieve episodic memories of specific experiences. It is better to ask about the details of price comparison (e.g., visit competitor website or side-by-side price comparison on Amazon or some similar site) than just inquire if they considered price during the purchase process.

Although we are not tracking ants, we have spread sensors out all over the beach, a wide network of fine mesh. Our beach, of course, is the high-dimensional space defined by all possible information sources. This space can be huge, over a billion combinations when we have only 30 information sources measured as yes or no. Still, as long as consumers confine their searches to low-dimensional subspaces, the data matrix will have the sparsity needed by the decompositional algorithm. That is, NMF will be successful as long as consumers adopt one of several established search pathways clearly marked by repeated consumer usage and marketing signage.

R code to create the V=WH data matrix and run the NMF package:

W=matrix(c(3,2,1,0,0,0,0,0,0,3,2,1), nrow=6)
H=matrix(c(2,0,1,0,0,2,0,1), nrow=2)
V=W%*%H
W; H; V
 
library(NMF)
fit<-nmf(V, 2, method="lee", nrun=20)
fit
round(basis(fit),3)
round(coef(fit))
round(basis(fit)%*%coef(fit))

Created by Pretty R at inside-R.org

Monday, September 22, 2014

What is Cluster Analysis? A Projective Test

Supposedly, projective tests (e.g., the inkblots of psychoanalysis) contain sufficient ambiguity that "what you see" reveals some aspect of your thinking that has escaped your awareness. Although the following will provide no insight into your neurotic thoughts or feelings, it might help separate two different way of performing and interpreting cluster analysis.

A light pollution map of the United States, a picture at night from a satellite orbiting the earth, is shown below.



Which of the following two representations more closely matches the way you think of this map?

Do you consider population density to be the mixture of distributions represented by the red spikes in the first option?




Or perhaps this mixture model is too passive for you, so that you prefer the air traffic representation in the second option showing separate airplane locations at some point in time.



The mclust package in R provides the more homeostatic first representation using density functions. Because mclust adjusts the shape of each normal distribution in the mixture, one can model the Northeast corridor from Boston to Philadelphia with a single cluster. Moreover, the documentation enables you to perform the analysis without excessive pain and to understand how finite mixture models work. If you need a video lecture on Gaussian mixtures, MathematicalMonk on YouTube is the place to start (aka Jeff Miller).

On the other hand, if airplanes can be considered as messages passed between nodes with greater concentrations (i.e., cities with airports), then the R package performing affinity propagation, apcluster, offers the more "self-organizing" model shown in the second option with many possible ways of defining similarity or affinity. Ease of use should not be a problem with a webinar, a comprehensive manual, and a link to the original Science article. However, the message propagation algorithm requires some work to comprehend the details. Fortunately, one can run the analysis, interpret the output, and know enough not to make any serious mistakes without all the computational intricacies.

And the true representation is? As a marketer, I see it as a dynamic process with concentrations supported by the seaports, rivers, railroad tracks, roads, and airports that served commerce over time. Population clusters continually evolve (e.g., imagine Las Vegas without air travel).  They are not natural kinds revealed by craving nature at its joints. Diversity comes in many shapes and forms, each requiring its own model with its unique assumptions concerning the underlying structures. More importantly, cluster analysis serves many different purposes with each setting its own criteria. Haven't we learned that one size does not fit all?



Thursday, September 18, 2014

The New Consumer Requires an Updated Market Segmentation

The new consumer is the old consumer with more options and fewer prohibitions. Douglas Holt calls it the postmodern market defined by differentiation: "consumer identities are being fragmented, proliferated, recombined, and turned into salable goods." It is not simply that more choices are available for learning about products, for sharing information with others and for making purchases. All that is true thanks to the internet. In addition, however, we have seen what Grant McCracken names plenitude, "an ever-increasing variety of observable ways of living and being that are continually coming into existence." Much more is available, and much more is acceptable.

For instance, the new digital consumer is no longer forced to choose one of the three major networks. Not only do they have other channels, but now they "watch" while connected through other devices. The family can get together in front of the TV with everyone doing their own thing. Shouldn't such consumer empowerment have some impact on how we segment the market?

Although we believe that the market is becoming more fragmented, our segment solutions still look the same. In fact, the most common segmentation of the digital consumer remains lifestyle. Thus, Experian's Fast Track Couple is defined by age and income with kids or likely to start a family soon. Of course, one's life stage is important and empty nesters do not behave like unmarried youths. But where is the fragmentation? What digital devices are used when and where and for what purposes? Moreover, who else is involved? We get no answers, just more of the same. For example, IBM responds to increasing diversity with its two-dimensional map based on usage type and intensity with a segment in every quadrant.


The key is to return to our new digital consumer who is doing what they want with the resources available to them. Everything may be possible but the wanting and the means impose a structure. Everyone does not own every device, nor do they use every feature. Instead, we discover recurrent patterns of specific device usage at different occasions with a limited group of others. As we have seen, the new digital consumer may own a high-definition TV, an internet-connected computer or tablet, a smartphone, a handheld or gaming console, a DVD/Blu-Ray player or recorder, a digital-media receiver for streaming, and then there is music. These devices can be for individual use or shared with others, at home or somewhere else, one at a time or multitasking, for planned activities or spontaneously, every day or for special occasions, with an owned library or online content, and the list could go on.

What can we learn from usage intensity data across such an array of devices, occasions and contexts? After all, topic modeling and sentiment analysis can be done with a "bag of words" listing the frequencies with which words occur in a text. Both are generative models assuming that the writer or speaker have something they want to say and they pick the words to express it. If all I had was a count of which words were used, could I infer the topic or the sentiment? If all I had was a measure of usage intensity across devices, occasions and contexts, could I infer something about consumer segments that would help me design or upsell products and services?

Replacing Similarity as the Basis for Clustering

Similarity, often expressed as distance, dominates cluster analysis, either pairwise distances between observations or between each observation and the segment centroids. Clusters are groupings such that observations in the same cluster are more similar to each other than they are to observations in other clusters. A few separated clouds of points on a two-dimensional plane displays the concept. However, we need lots of dimensions to describe our new digital consumer, although any one individual is likely to be close to the origin of zero intensity on all but a small subset of the dimensions. Similarity or distance loses its appeal as the number of dimensions increase and the space becomes more sparse (the curse of dimensionality).

Borrowing from topic modeling, we can use non-negative matrix factorization (NMF) without ever needing to calculate similarity. What are the topics or thematic structures underlying the usage patterns of our new digital consumer? What about personal versus shared experiences? Would we not expect a different pattern of usage behavior for those wanting their own space and those wanting to bring people together? Similarly, those seeking the "ultimate experience" within their budgets might be those with the high quality speakers or the home theater or latest gaming console and newest games. The social networker multitasks and always keeps in contact. The collector builds their library. Some need to be mobile and have access while in transit. I could continue, but hopefully it is clear that one expects to see recurring patterns in the data.

NMF uncovers those pattern by decomposing the data matrix with individuals as the rows and usage intensities as the columns. As I have shown before and show again below, the data matrix V is factored into a set of latent features forming the rows of H and individual scores on those same latent features in the rows of W. We can see the handiwork of the latent features in the repeating pattern of usage intensities. Who does A, B, C, and D with such frequency? It must be a person of this type engaging in this kind of behavior.

You can make this easy by thinking of H as a set of factor loading for behaviors (turned on its side) and W as the corresponding individual factor scores. For example, it is reasonable to believe that at least some of our new digital consumers will be gamers, so we expect to see one row of H with high weights or loadings for all the game related behaviors in the columns of H. Say that row is the first row, then the first column of W tells us how much each consumers engages in gaming activities. The higher the score in the first column of W, the more intense the gamer. People who never game get a score of zero.


In the above figure there are only two latent features. We are trying to reproduce the data matrix with as many latent features as we can interpret. To be clear, we are not trying to reproduce all the data as closely as possible because some of that data will be noise. Still, if I look at the rows of H and can quickly visualize and name all the latent features, I am a happy data analyst and will retain them all.

The number of latent features will depend on the underlying data structure and the diversity of the intensity measures. I have reported 22 latent features for a 218 item adjective rating scale. NMF, unlike the singular value decomposition (SVD) associated with factor analysis, does not attempt to capture as much variation as possible. Instead, NMF identifies additive components, and consequently we tend to see something more like micro-genre or micro-segments.

So far, I have only identified the latent features. Sometimes that is sufficient, and individuals can be classified by looking at their row in W and classifying them as belonging to the latent feature with the largest score. But what if a few of our gamers also watched live sports on TV? It is helpful to recall that latent features are shared patterns so that we would not extract a separate latent feature for gaming and for live TV sports if everyone who did one did the other, in which case there would be only one latent feature with both sets of intensity measures loading on it in H.

The latent feature scores in W can be treated like any other derived score and can enter into any other analysis as data points. Thus, we can cluster the rows of W, now that we have reduced the dimensionality from the columns of V to the columns of W and similarity has become a more meaningful metric (though care must be taken if W is sparse). The heat maps produced by the NMF package attach a dendrogram at the side displaying the results of a hierarchical cluster analysis. Given that we have the individual latent feature scores, we are free to create a biplot or cluster with any method we choose.

R Makes It So Easy with the NMF Package

Much of what you know about k-means and factor analysis generalized to NMF. That is, like factor analysis one needs to specify the number of latent features (rank r) and interpret the factor loadings contained in H (after transposing or turning it sideways). You can find all the R code and the all the output explained in a previous post. As one has the scree plot in factor analysis, there are several such plots in NMF that some believe will help one solve the number of factors problem. The NMF vignette outlines the process under the heading "Estimating the factorization rank" in Section 2.6. Personally, I find such aids to be of limited value, relying instead on interpretability as the criteria for keeping or discarding latent features.

Finally, NMF runs into all the problem experienced using k-means, the most serious being local minima. Local minima are recognized when the solution seems odd or when you repeat the same analysis and get a very different solution. Similar to k-means, one can redo the analysis many times with different random starting values. If needed, one can specify the seeding method so that a different initialization starts the iterative process (see Section 2.3 of the vignette). Adjusting the number of different random starting values until consistent solutions are achieved seems to work quite well with marketing data that contain separable groupings of rows and columns. That is, factorization works best when there are actual factors generating the data matrix, in this case, types of consumers and kinds of activities that are distinguishable (e.g., some game and some do not, some only stream and others rent or own DVDs, some only shop online and others search online but buy locally).

Saturday, September 13, 2014

The Ecology of Data Matrices: A Metaphor for Simultaneous Clustering

"...a metaphor is an affair between a predicate with a past and an object that yields while protesting." 

It is, as if, data matrices were alive. The rows are species, and the columns are habitats. At least that seems to be the case with recommender systems. Viewers seek out and watch only those movies that they expect to like while remaining unaware of most of what is released or viewed by others. The same applies to music. There are simply too many songs to listen to them all. In fact, most of us have very limited knowledge of what is available. Music genre may have become so defining of who we are and who we associate with that all of us have become specialists with little knowledge of what is on the market outside of our small circle.

Attention operates in a similar manner. How many ads are there on your web page or shown during your television program? Brand recognition is not random for we are drawn to the ones we know and prefer. We call it "affordance" when the columns of our data matrices are objects or activities: what to eat on a diet or what to do on a vacation. However, each of us can name only a small subset of all that can be eaten when dieting or all that can be done on a vacation. Preference is at work even before thinking and is what gets stuff noticed.

Such data create problems for statistical modeling that focuses solely on the rows or on the columns and treats the other mode as fixed. For example, cluster analysis takes the columns as given and calculates pairwise distances between rows (hierarchical clustering) or distances of rows from cluster centroids (kmeans). This has become a serious concern for the clustering of high dimensional data as we have seen with the proliferation of names for the simultaneous clustering of rows and columns: biclustering, co-clustering, subspace clustering, bidimensional clustering, block clustering, two-mode or two-way clustering and many more. The culprit is that the rows and columns are confounded sufficiently that it makes less and less sense to treat them as independent entities. High dimensionality only makes the confounding more pronounced.

As a marketing researcher, I work with consumers and companies who have intentions and act on them. Thus, I find the ecology metaphor compelling. It is not mandatory, and you are free to think of the data matrix as a complex system within which rows and columns are coupled. Moreover, the ecology metaphor does yield additional benefits since numerical ecology has a long history of trying to quantify dissimilarity given the double-zero problem. The Environmetrics Task View list the R packages dealing with this problem under the subheading Dissimilarity Coefficients. Are two R programmers more or less similar if neither of them has any experience with numerical ecology in R? This is the double-zero problem. The presence of two species in a habitat does not mean the same as the absence of two species. One can see similar concerns raised in statistics and machine learning under the heading "the curse of dimensionality" (see Section 3 of this older but well-written explanation).

Simultaneous Clustering in R

In order to provide a lay of the land, I will name at least five different approaches to simultaneous clustering. Accompanying each approach is an illustrative R package. The heading, simultaneous clustering, is meant to convey that the rows and columns are linked in ways that ought to impact the analysis. However, the diversity of the proposed solutions makes any single heading unsatisfying.
  1. Matrix factorization (NMF),
  2. Biclustering (biclust),
  3. Variable selection  (clustvarsel),
  4. Regularization (sparcl) , and
  5. Subspace clustering (HDclassif).
Clearly, there is more than one R package in each category, but I was more interested in an example than a catalog. I am not legislating; you are free to sort these R packages into your own categories and provide more or different R packages. I have made some distinctions that I believe are important and selected the packages that illustrate my point. I intend to be brief.

(1) Nonnegative matrix factorization (NMF) is an algorithm from linear algebra that decomposes the data matrix into the product of a row and column representation. If one were to separate clustering approaches into generative models and summarizing techniques, all matrix factorizations would fall toward the summary side of the separation. My blog is full of recent posts illustrating how well NMF works with marketing data.

(2) Biclustering has the feel of a Rubik's Cube with row and columns being relocated. Though the data matrix is not a cube, the analogy works because one gets the dynamics of moving entire rows and columns all at the same time. In spite of the fact that the following figure is actually from the BlockCluster R package, it illustrates the concept. Panel a contains the data matrix for 10 rows labeled A through J and 7 numbered columns. Panel b reorders the rows (note for instance that row B is move down and row H is moved up). Panel c continues the process by reordering some columns so that they follow the pattern summarized schematically in Panel d. To see how this plays out in practice, I have added this link to a market segmentation study analyzed with the biclust R package.


Now, we return to generative models. (3) Variable selection is a variation on the finite mixture model from the model-based clustering R package mclust. As the name implies, it selects those variables needed to separate the clusters. The goal is to improve performance by removing irrelevant variables. More is better only when the more does not add more noise, in which case, more blurs the distinctions among the clusters. (4) Following the same line of reasoning, sparse clustering relies on a lasso-type penalty (regularization) to select features by assigning zero or near zero weights to the less differentiating variables. The term "sparse" refers to the variable weights and not the data matrix. Both variable selection and spare clustering deal with high dimensionality by reducing the number of variables contributing to the cluster solution.

(5) This brings us to my last category of subspace clustering, which I will introduce using the pop-up toaster as my example. Yes, I am speaking of the electric kitchen appliance with a lever and slots for slices of bread and other toastables ("let go of my eggo"). If you have a number of small children in the family, you might care about safety features, number of slots and speed of heating. On the other hand, an upscale empty nester might be concerned about the brand or how it looks on the counter when they entertain. The two segments reside in different subspaces, each with low dimensionality, but defined by different dimensions. The caring parent must know if the unit automatically shuts off when the toaster falls off the counter. The empty nester never inquires and has no idea.

None of this would be much of an issue if it did not conceal the underlying cluster solution. All measurement adds noise, and noise makes the irrelevant appear to have some minor impact. The higher the data dimensionality, the greater the distortion. Consumers will respond to every question even when asked about the unattended, the inconsequential or the unknown. Random noise is bad; systematic bias is even worst (e.g., social desirability, acquiescence and all the other measurement biases). Sparse clustering pushes the minor effects toward zero. Subspace clustering allows the clusters to have their own factor structures with only a very few intrinsic dimensions (as HDclassif calls it). For one segment the toaster is an appliance that looks good and prepares food to taste. For the other segment it is a way to feed quickly and avoid complaints while not getting anyone injured in the process. These worlds are as incommensurable as ever imagined by Thomas Kuhn.






Monday, September 1, 2014

Attention Is Preference: A Foundation Derived from Brand Involvement Segmentation

"A wealth of information creates a poverty of attention."

We categorize our world so that we can ignore most of it. In order to see figure, everything else must become ground. Once learned, the process seems automatic, and we forget how hard and long it took to achieve automaticity. It is not easy learning how to ride a bicycle, but we never forget. The same can be said of becoming fluent in a foreign language or learning R or deciding what toothpaste to buy. The difficulty of the task varies, yet the process remains the same.
Our attention is selective, as is our exposure to media and marketing communications. Neither is passive, although our awareness is limited because the process is automatic. We do not notice advertising for products we do not buy or use. We walk pass aisles in the grocery store and never see anything on the shelves until a recipe or changing circumstances require that we look. We are lost in conversation at the cocktail party until we hear someone call our name. Source separation requires a considerable amount of learning and active processing of which we are unaware until it is brought to our attention.

Attention is Preference and the First Step in Brand Involvement

To attend is to prefer, though that preference may be negative as in avoidance rather than approach. Attention initiates the purchase process, so this is where we should begin our statistical modeling. We are not asking the consumer for inference, "Which of these contributes most to your purchase choice?" We are merely taking stock or checking inventory. If you wish to have more than a simple checklist, one can inquire about awareness, familiarity and usage for all of these are stored in episodic memory. In a sense, we are measuring attentional intensity with a behaviorally anchored scale. Awareness, familiarity and usage are three hurdles that a brand must surpass in order to achieve success. Attention becomes a continuum measured with milestones as brand involvement progresses from awareness to habit.

Still, the purpose of selective attention is simplification so that much of the market and its features will never past the first hurdle. We recognize and attend to that which is already known and familiar, and in the process, all else become background. Take a moment the next time for are at the supermarket making your usual purchases. As you reach for your brand, look at the surrounding area for all the substitute products that you never noticed because you were focused on one object on the shelf. In order to focus on one product or brand or feature, we must inhibit our response to all the rest. As the number of alternatives grow, attention becomes more scarce.

The long tail illustrates the type of data that needs to be modeled. If you enter "long tail" into a search engine looking for images, you will discover that the phenomena seems to be everywhere as a descriptive model of product purchase, feature usage, search results and more. We need to be careful and keep the model descriptive rather than as claims that the future is selling less of more. For some childlike reason, I personally prefer the following image describing search results with the long tail represented by the dinosaur rather than the more traditional product popularity of the new marketplace.



Unfortunately, this figure conceals the heterogeneity that produces the long tail. In the aggregate we appear to have homogeneity when the tail may be produced by many niche segments seeking distinct sets of products or features. Attention is selective and enables us to ignore most of the market, yet individual consumers attend to their own products and features. Though we may wish to see ourselves as unique individuals, there are always many others with similar needs and interests so that each of us belongs to a community whether we know it or not. Consequently, we start our study of preference by identify consumer types who live in disparate worlds created by selective exposure and attention to different products and features.

Building a Foundation with Brand Involvement Segmentation

Even without intent, our attention is directed by prior experience and selective exposure through our social network and the means by which we learn about and buy products and services. Sparsity is not accidental but shaped by wants and needs within a particular context. For instance, knowing the brand and type of makeup that you buy tells me a great deal about your age and occupation and social status (and perhaps even your sex). Even if we restrict our sample to those who regularly buy makeup, the variety among users, products and brands is sufficient to generate segments who will never buy the same makeup through the same channel.

Why not just ask about benefits they are seeking or the features that interest them and cluster on the ratings or some type of forced choice (e.g., best-worst scaling)? Such questions do not access episodic memory and do not demand that the respondent relive past events. Instead, the responses are relatively complex constructions controlled by conversational rules that govern how and what we say about ourselves when asked by strangers.

As I have tried to outline in two previous posts, consumers do not possess direct knowledge of their purchase processes. Instead, they observe themselves and infer why they seem to like this but not that. Moreover, unless the question asks for recall of a specific occurrence, the answer will reflect the gist of the memory and measure overall affect (e.g., a halo effect). Thus, let us not contaminate our responses by requesting inference but restrict ourselves to concrete questions that can be answered by more direct retrieval. While all remembering is a constructive process, episodic memory require less assembly.

Nonnegative Matrix Factorization (NMF)

Do we rely so much on rating scales because our statistical models cannot deal easily with highly skewed variables where the predominant response is never or not applicable? If so, R provides an interface to nonnegative matrix factorization (NMF), an algorithm that thrives on such sparse data matrices. During the past six weeks my posts have presented the R code needed to perform a NMF and have tried to communicate an intuitive sense of how and why such matrix factorization works in practice. You need only look in the titles for the keywords "matrix factorization" to find additional details in those previous posts.

I will draw an analogy with topic modeling in an attempt to explain this approach. Topic modeling starts with a bag of words used in a collection of documents. The assumptions are that the documents cover different topics and that the words used reflect the topics discussed by each document. In our makeup example, we might present a long checklist of brands and products replacing the bag of words in topic modeling. Then, instead of word counts as our intensity measure, we might ask about familiarity using an ordinal intensity scale (e.g., 0=never heard, 1=heard but not familiar, 2=somewhat familiar but never used, 3=used but not regularly, and 4=use regularly). Just as the word "401K" implies that the document deals with a financial topic, regular purchasing of Clinique Repairwear Foundation from Nordstrom helps me located you within a particular segment of the cosmetics market. Nordstrom is an upscale department store, Clinique is not a mass market brand, and you can probably guess who Repairwear Foundation is for by the name alone.

Unfortunately, it is very difficult in a blog post to present brand usage data at a level of specificity that demonstrates how NMF is capable of handling many variables with much of the data being zeros (no awareness). Therefore, I will only attempt to give a taste of what such an analysis would look like with actual data. I have aggregated the data into brand-level familiarity using the ordinal scale discussed in the previous paragraph. I will not present any R code because the data are proprietary, and instead refer you to previous posts where you can find everything you need to run the nmf function from the NMF package (e.g., continuous or discrete latent structure, learn from top rankings, and pathways in consumer decision journey).

The output can be summarized with two heatmaps: one indicating the "loadings" of the brands on the latent features so that we can name those hidden constructs and the second clustering individuals based on those latent features.

Like factor analysis one can vary the number of latent variables until an acceptable solution is found. The NMF package offers a number of criteria, but interpretability must take precedence. In general, we want to see a lot of yellow indicating that we have achieved some degree of simple structure. It would be helpful if each latent features was anchored, that is, a few rows or columns with values near one. This is a restatement of the varimax criteria in factor rotation (see Varimax, page 3). The variance of factor loadings is maximized when the distribution is bimodal, and this type of separation is what we are seeking from our NMF.

The dendrogram at the top of the following heatmap displays the results of a hierarchical clustering of the brands based on their association with the latent features. It is a good place to start. I am not going into much detail, but let me name the latent features from Rows 1-6:  1. Direct Sales, 2. Generics, 3. Style, 4. Mass Market, 5. Upscale, and 6. Beauty Tools. The segments were given names that would be accessible to those with no knowledge of the cosmetics market. That is, differentiated retail markets in general tend to have a lower end with generic brands, a mass market in the middle with the largest share, and a group of more upscale brands at the high end. The distribution channel also has its impact with direct sales adding differentiation to the usual separation between supermarkets, drug stores, department and specialty stores.


Now, let us look at the same latent features in the second heatmap below using the dendrogram on the left as our guide. You should recall that the rows are consumers, so that the hierarchical clustering displayed by the dendrogram can be considered a consumer segmentation. As we work our way down from the top, we see the mass market in Column 4 (looking for both reddish blocks and gaps in the dendrogram), direct sales in Column1 (again based on darker color but also glancing at dendrogram), and beauty tools in Column 6. All three of these clusters are shown to be joined by the dendrogram later in the hierarchical process. The upscale in Column 5 form their own cluster according to the dendrogram, as do the generic in Column 2. Finally, Column 3 represents those consumer who are more familiar with artistic brands.


My claim is that segments live in disparate worlds or at least segregated neighborhoods defined in this case study by user imagery (e.g., age and social status) and place of purchase  (e.g., direct selling, supermarkets and drug stores, and the more upscale department and specialty store). These segments may use similar vocabulary but probably mean something different. Everyone speaks of product quality and price, however, each segment is applying such terms relative to their own circumstances. The drugstore and the department store shoppers have a different price range in mind when they tells us that price is not an important consideration in their purchase.

Without knowing the segment or the context, we learn little from asking importance ratings or forced tradeoffs such as MaxDiff, which is why the word "foundation" describes the brand involvement segmentation. We now have a basis for the interpretation of all perceptual and importance data collected with questions that have no concrete referent. The resulting segments ought to be analyzed separately for they are different communities speaking their own languages or at least having their own definitions of terms such as cost, quality, innovative, prestige, easy, service and support.

Of course, I have oversimplified to some extent in order for you to see the pattern that can be recovered from the heatmaps. We need to examine the dendrogram more carefully since each individual buys more than one brand as makeup for different occasions (e.g., day and evening, work and social). In fact, NMF is able to get very concrete and analyze the many possible combinations of product, brand, and usage occasion. More importantly, NMF excels with sparse data matrices so do not be concerned if 90% of your data are zeros. The key to probing episodic memory is maintaining high imagery by asking for specifics with details about the occasion, the product and the brand so that the respondent may relive the experience. It may be a long list, but relevance and realism will encourage the respondent to complete a lengthy but otherwise easy task.

Lastly, one does need to accept the default of hierarchical clustering provided in the heatmap function. Some argue that an all-or-none hard clustering based on the highest latent feature weight or mixing coefficient is sufficient, and it may be if the individuals are well separated. However, you have the weights for every respondent so that any clustering method is an alternative. K-means is often suggested as it is the workhorse of clustering for good reason. Of course, the choice of clustering method depends on your prior beliefs concerning the underlying cluster structure, which would require some time to discuss. I will only note that I have experimented with some interesting options, including affinity propagation, and have had some success.

Postscript: It is not necessary to measure brand involvement across its entire range from attention through acquaintance to familiarity and habit. I have been successful with an awareness checklist. Yes, preference can be accessed with a simple recognition task (e.g., presenting a picture from a retail store with all the toothpastes in their actual places on the shelves and asking which ones have they been seen before). Preference is everywhere because affect guides everything we notice, search for, learn about, discuss with others, buy, use, make a habit of, or recommend. All we needed was a statistical model for uncovering the pattern hidden in the data matrix.