Monday, September 1, 2014

Attention Is Preference: A Foundation Derived from Brand Involvement Segmentation

"A wealth of information creates a poverty of attention."

We categorize our world so that we can ignore most of it. In order to see figure, everything else must become ground. Once learned, the process seems automatic, and we forget how hard and long it took to achieve automaticity. It is not easy learning how to ride a bicycle, but we never forget. The same can be said of becoming fluent in a foreign language or learning R or deciding what toothpaste to buy. The difficulty of the task varies, yet the process remains the same.
Our attention is selective, as is our exposure to media and marketing communications. Neither is passive, although our awareness is limited because the process is automatic. We do not notice advertising for products we do not buy or use. We walk pass aisles in the grocery store and never see anything on the shelves until a recipe or changing circumstances require that we look. We are lost in conversation at the cocktail party until we hear someone call our name. Source separation requires a considerable amount of learning and active processing of which we are unaware until it is brought to our attention.

Attention is Preference and the First Step in Brand Involvement

To attend is to prefer, though that preference may be negative as in avoidance rather than approach. Attention initiates the purchase process, so this is where we should begin our statistical modeling. We are not asking the consumer for inference, "Which of these contributes most to your purchase choice?" We are merely taking stock or checking inventory. If you wish to have more than a simple checklist, one can inquire about awareness, familiarity and usage for all of these are stored in episodic memory. In a sense, we are measuring attentional intensity with a behaviorally anchored scale. Awareness, familiarity and usage are three hurdles that a brand must surpass in order to achieve success. Attention becomes a continuum measured with milestones as brand involvement progresses from awareness to habit.

Still, the purpose of selective attention is simplification so that much of the market and its features will never past the first hurdle. We recognize and attend to that which is already known and familiar, and in the process, all else become background. Take a moment the next time for are at the supermarket making your usual purchases. As you reach for your brand, look at the surrounding area for all the substitute products that you never noticed because you were focused on one object on the shelf. In order to focus on one product or brand or feature, we must inhibit our response to all the rest. As the number of alternatives grow, attention becomes more scarce.

The long tail illustrates the type of data that needs to be modeled. If you enter "long tail" into a search engine looking for images, you will discover that the phenomena seems to be everywhere as a descriptive model of product purchase, feature usage, search results and more. We need to be careful and keep the model descriptive rather than as claims that the future is selling less of more. For some childlike reason, I personally prefer the following image describing search results with the long tail represented by the dinosaur rather than the more traditional product popularity of the new marketplace.



Unfortunately, this figure conceals the heterogeneity that produces the long tail. In the aggregate we appear to have homogeneity when the tail may be produced by many niche segments seeking distinct sets of products or features. Attention is selective and enables us to ignore most of the market, yet individual consumers attend to their own products and features. Though we may wish to see ourselves as unique individuals, there are always many others with similar needs and interests so that each of us belongs to a community whether we know it or not. Consequently, we start our study of preference by identify consumer types who live in disparate worlds created by selective exposure and attention to different products and features.

Building a Foundation with Brand Involvement Segmentation

Even without intent, our attention is directed by prior experience and selective exposure through our social network and the means by which we learn about and buy products and services. Sparsity is not accidental but shaped by wants and needs within a particular context. For instance, knowing the brand and type of makeup that you buy tells me a great deal about your age and occupation and social status (and perhaps even your sex). Even if we restrict our sample to those who regularly buy makeup, the variety among users, products and brands is sufficient to generate segments who will never buy the same makeup through the same channel.

Why not just ask about benefits they are seeking or the features that interest them and cluster on the ratings or some type of forced choice (e.g., best-worst scaling)? Such questions do not access episodic memory and do not demand that the respondent relive past events. Instead, the responses are relatively complex constructions controlled by conversational rules that govern how and what we say about ourselves when asked by strangers.

As I have tried to outline in two previous posts, consumers do not possess direct knowledge of their purchase processes. Instead, they observe themselves and infer why they seem to like this but not that. Moreover, unless the question asks for recall of a specific occurrence, the answer will reflect the gist of the memory and measure overall affect (e.g., a halo effect). Thus, let us not contaminate our responses by requesting inference but restrict ourselves to concrete questions that can be answered by more direct retrieval. While all remembering is a constructive process, episodic memory require less assembly.

Nonnegative Matrix Factorization (NMF)

Do we rely so much on rating scales because our statistical models cannot deal easily with highly skewed variables where the predominant response is never or not applicable? If so, R provides an interface to nonnegative matrix factorization (NMF), an algorithm that thrives on such sparse data matrices. During the past six weeks my posts have presented the R code needed to perform a NMF and have tried to communicate an intuitive sense of how and why such matrix factorization works in practice. You need only look in the titles for the keywords "matrix factorization" to find additional details in those previous posts.

I will draw an analogy with topic modeling in an attempt to explain this approach. Topic modeling starts with a bag of words used in a collection of documents. The assumptions are that the documents cover different topics and that the words used reflect the topics discussed by each document. In our makeup example, we might present a long checklist of brands and products replacing the bag of words in topic modeling. Then, instead of word counts as our intensity measure, we might ask about familiarity using an ordinal intensity scale (e.g., 0=never heard, 1=heard but not familiar, 2=somewhat familiar but never used, 3=used but not regularly, and 4=use regularly). Just as the word "401K" implies that the document deals with a financial topic, regular purchasing of Clinique Repairwear Foundation from Nordstrom helps me located you within a particular segment of the cosmetics market. Nordstrom is an upscale department store, Clinique is not a mass market brand, and you can probably guess who Repairwear Foundation is for by the name alone.

Unfortunately, it is very difficult in a blog post to present brand usage data at a level of specificity that demonstrates how NMF is capable of handling many variables with much of the data being zeros (no awareness). Therefore, I will only attempt to give a taste of what such an analysis would look like with actual data. I have aggregated the data into brand-level familiarity using the ordinal scale discussed in the previous paragraph. I will not present any R code because the data are proprietary, and instead refer you to previous posts where you can find everything you need to run the nmf function from the NMF package (e.g., continuous or discrete latent structure, learn from top rankings, and pathways in consumer decision journey).

The output can be summarized with two heatmaps: one indicating the "loadings" of the brands on the latent features so that we can name those hidden constructs and the second clustering individuals based on those latent features.

Like factor analysis one can vary the number of latent variables until an acceptable solution is found. The NMF package offers a number of criteria, but interpretability must take precedence. In general, we want to see a lot of yellow indicating that we have achieved some degree of simple structure. It would be helpful if each latent features was anchored, that is, a few rows or columns with values near one. This is a restatement of the varimax criteria in factor rotation (see Varimax, page 3). The variance of factor loadings is maximized when the distribution is bimodal, and this type of separation is what we are seeking from our NMF.

The dendrogram at the top of the following heatmap displays the results of a hierarchical clustering of the brands based on their association with the latent features. It is a good place to start. I am not going into much detail, but let me name the latent features from Rows 1-6:  1. Direct Sales, 2. Generics, 3. Style, 4. Mass Market, 5. Upscale, and 6. Beauty Tools. The segments were given names that would be accessible to those with no knowledge of the cosmetics market. That is, differentiated retail markets in general tend to have a lower end with generic brands, a mass market in the middle with the largest share, and a group of more upscale brands at the high end. The distribution channel also has its impact with direct sales adding differentiation to the usual separation between supermarkets, drug stores, department and specialty stores.


Now, let us look at the same latent features in the second heatmap below using the dendrogram on the left as our guide. You should recall that the rows are consumers, so that the hierarchical clustering displayed by the dendrogram can be considered a consumer segmentation. As we work our way down from the top, we see the mass market in Column 4 (looking for both reddish blocks and gaps in the dendrogram), direct sales in Column1 (again based on darker color but also glancing at dendrogram), and beauty tools in Column 6. All three of these clusters are shown to be joined by the dendrogram later in the hierarchical process. The upscale in Column 5 form their own cluster according to the dendrogram, as do the generic in Column 2. Finally, Column 3 represents those consumer who are more familiar with artistic brands.


My claim is that segments live in disparate worlds or at least segregated neighborhoods defined in this case study by user imagery (e.g., age and social status) and place of purchase  (e.g., direct selling, supermarkets and drug stores, and the more upscale department and specialty store). These segments may use similar vocabulary but probably mean something different. Everyone speaks of product quality and price, however, each segment is applying such terms relative to their own circumstances. The drugstore and the department store shoppers have a different price range in mind when they tells us that price is not an important consideration in their purchase.

Without knowing the segment or the context, we learn little from asking importance ratings or forced tradeoffs such as MaxDiff, which is why the word "foundation" describes the brand involvement segmentation. We now have a basis for the interpretation of all perceptual and importance data collected with questions that have no concrete referent. The resulting segments ought to be analyzed separately for they are different communities speaking their own languages or at least having their own definitions of terms such as cost, quality, innovative, prestige, easy, service and support.

Of course, I have oversimplified to some extent in order for you to see the pattern that can be recovered from the heatmaps. We need to examine the dendrogram more carefully since each individual buys more than one brand as makeup for different occasions (e.g., day and evening, work and social). In fact, NMF is able to get very concrete and analyze the many possible combinations of product, brand, and usage occasion. More importantly, NMF excels with sparse data matrices so do not be concerned if 90% of your data are zeros. The key to probing episodic memory is maintaining high imagery by asking for specifics with details about the occasion, the product and the brand so that the respondent may relive the experience. It may be a long list, but relevance and realism will encourage the respondent to complete a lengthy but otherwise easy task.

Lastly, one does need to accept the default of hierarchical clustering provided in the heatmap function. Some argue that an all-or-none hard clustering based on the highest latent feature weight or mixing coefficient is sufficient, and it may be if the individuals are well separated. However, you have the weights for every respondent so that any clustering method is an alternative. K-means is often suggested as it is the workhorse of clustering for good reason. Of course, the choice of clustering method depends on your prior beliefs concerning the underlying cluster structure, which would require some time to discuss. I will only note that I have experimented with some interesting options, including affinity propagation, and have had some success.

Postscript: It is not necessary to measure brand involvement across its entire range from attention through acquaintance to familiarity and habit. I have been successful with an awareness checklist. Yes, preference can be accessed with a simple recognition task (e.g., presenting a picture from a retail store with all the toothpastes in their actual places on the shelves and asking which ones have they been seen before). Preference is everywhere because affect guides everything we notice, search for, learn about, discuss with others, buy, use, make a habit of, or recommend. All we needed was a statistical model for uncovering the pattern hidden in the data matrix.

Monday, August 25, 2014

Continuous or Discrete Latent Structure? Correspondence Analysis vs. Nonnegative Matrix Factorization

A map gives us the big picture, which is why mapping has become so important in marketing research. What is the perceptual structure underlying the European automotive market? All we need is a contingency table with cars as the rows, attributes as the columns, and the cells as counts of the number of times each attribute is associated with each feature. As shown in a previous post, correspondence analysis (CA) will produce maps like the following.


Although everything you need to know about this graphic display can be found in that prior post, I do wish to emphasize a few points. First, the representation is a two-dimensional continuous space with coordinates for each row and each column. Second, the rows (cars) are positioned so that the distance between any two rows indicates the similarity of their relative attribute perceptions (i.e., different cars may score uniformly higher or lower, but they have the same pattern of strengths and weaknesses). Correspondingly, the columns (attributes) are located closer to each other when they are used similarly to describe the automobiles. Distances between the rows and columns are not directly shown on this map, yet the cross-tabulation from the original post shows that autos are near the attributes on which they performed the best. The analysis was conducted with the R package anacor, however, the ca r package might provide a gentler introduction to CA especially when paired with Greenacre's online tutorial.

CA yields a continuous representation. The first dimension separates economy from luxury vehicles, and the second dimension differentiates between the smaller and the larger cars. Still, one can identify regions or clusters within this continuous space. For example, one could easily group the family cars in the third quadrant. Such an interpretation is consistent with the R package from which the dataset was borrowed (e.g., Slide #6). A probabilistic latent feature model (plfm) assumes that the underlying structure is defined by binary features that are hidden or unobserved.

What is in the mind of our raters? Do they see the vehicles as possessing more or less of the two dimensions from the CA, or are their perceptions driven by a set of on-off features (e.g., small popular, sporty status, spacious family, quality luxury, green, and city car)? If the answer is a latent category structure, then the success of CA stems from its ability to reproduce the row and column profiles from a dimensional representation even when the data were generated from the perceived presence or absence of latent features. Alternatively, the seemingly latent features may well be nothing more than an uneven distribution of rows and columns across the continuous space. We have the appearance of discontinuity simply because there are empty spaces that could be filled by adding more objects and features.

Spoiler alert: An adaptive consumer improvises and adopts whenever representational system works in that context. Dimensional maps provide the overview of the terrain and seem to be employed whenever many objects and/or features need to be consider jointly. Detailed trade-offs focus in on the features. No one should be surprised to discover a pragmatic consumer switching between decision strategies with their associated spatial or category representations over the purchase process as needed to complete their tasks.

Nonnegative Matrix Factorization of Car Perceptions

I will not repeat the comprehensive and easy to follow analysis of this automobile data from the plfm R package. All the details are provided in Section 4.3 of Michel Meulders' Journal of Statistical Software article (see p. 13 for a summary). Instead, I will demonstrate how nonnegative matrix factorization (NMF) produces the same results utilizing a different approach. At the end of my last post, you can find links to all that I have written about NMF. What you will learn is that NMF extracts latent features when it restricted everything to be nonnegative. This is not a necessary result, and one can find exceptions in the literature. However, as we will see later in this post, there are good reasons to believe that NMF will deliver feature-like latent variables with marketing data.

We require very little R code to perform the NMF. As shown below, we attach the plfm package and the dataset named car, which is actually a list of three elements. The cross-tabulation is an element of the list with the name car$freq1. The nmf function from the NMF package takes the data matrix, the number of latent features (plfm set the rank to 6), the method (lee) and the number of times to repeat the analysis with different starting values. Like K-means, NMF can find itself lost in a local minimum, so it is a good idea to rerun the factorization with different random start values and keep the best solution. We are looking for a global minimum, thus we should set nrun to a number large enough to ensure that one will find a similar result when the entire nmf function is executed again.

library(plfm)
data(car)
 
library(NMF)
fit<-nmf(car$freq1, 6, "lee", nrun=20)
 
h<-coef(fit)
max_h<-apply(h,1,function(x) max(x))
h_scaled<-h/max_h
library(psych)
fa.sort(t(round(h_scaled,3)))
 
w<-basis(fit)
wp<-w/apply(w,1,sum)
fa.sort(round(wp,3))
 
coefmap(fit)
basismap(fit)

Created by Pretty R at inside-R.org

In order not to be confused by the output, one needs to note the rows and columns of the data matrix. The cars are the rows and the features are the columns. The basis is always rows-by-latent features, therefore, our basis with be a cars-by-six latent features. The coefficient matrix is always latent features-by-columns or six-latent features-by-observed features. It is convenient to print the transpose of the coefficient matrix since the number of latent features is often much less than the number of observed features.

Basic Matrix
Green
Family
Luxury
Popular
City
Sporty
Toyota Prius
0.82
0.08
0.08
0.00
0.01
0.02
Renault Espace
0.09
0.71
0.02
0.00
0.19
0.00
Citroen C4 Picasso
0.18
0.58
0.00
0.10
0.14
0.00
Ford Focus Cmax
0.00
0.50
0.04
0.35
0.11
0.00
Volvo V50
0.24
0.39
0.29
0.08
0.00
0.00
Mercedes C-class
0.04
0.01
0.69
0.00
0.11
0.16
Audi A4
0.00
0.10
0.43
0.14
0.12
0.21
Opel Corsa
0.17
0.00
0.00
0.83
0.01
0.00
Volkswagen Golf
0.00
0.02
0.29
0.67
0.02
0.00
Mini Cooper
0.00
0.00
0.15
0.00
0.70
0.15
Fiat 500
0.33
0.00
0.00
0.18
0.49
0.00
Mazda MX5
0.01
0.00
0.03
0.00
0.26
0.70
BMW X5
0.00
0.18
0.26
0.00
0.00
0.56
Nissan Qashgai
0.06
0.35
0.00
0.08
0.00
0.51
Coefficient Matrix
Green
Family
Luxury
Popular
City
Sporty
Environmentally friendly
1.00
0.05
0.08
0.34
0.19
0.00
Technically advanced
0.68
0.00
0.62
0.00
0.00
0.35
Green
0.66
0.02
0.06
0.06
0.04
0.00
Family Oriented
0.35
1.00
0.24
0.08
0.00
0.00
Versatile
0.15
0.53
0.27
0.25
0.00
0.16
Luxurious
0.00
0.10
1.00
0.00
0.12
0.56
Reliable
0.21
0.27
0.95
0.69
0.06
0.18
Safe
0.08
0.34
0.88
0.41
0.00
0.10
High trade-in value
0.00
0.00
0.85
0.21
0.00
0.13
Comfortable
0.08
0.57
0.84
0.15
0.04
0.19
Status symbol
0.08
0.00
0.81
0.00
0.40
0.60
Sustainable
0.33
0.23
0.71
0.44
0.00
0.02
Workmanship
0.24
0.03
0.58
0.00
0.00
0.25
Practical
0.09
0.60
0.17
1.00
0.52
0.00
City focus
0.51
0.00
0.00
0.94
0.93
0.00
Popular
0.00
0.23
0.25
0.94
0.52
0.00
Economical
0.90
0.13
0.00
0.93
0.27
0.00
Good price-quality ratio
0.35
0.25
0.00
0.88
0.08
0.12
Value for the money
0.12
0.16
0.10
0.60
0.01
0.10
Agile
0.12
0.06
0.18
0.87
1.00
0.16
Attractive
0.04
0.08
0.58
0.33
0.79
0.50
Nice design
0.04
0.10
0.38
0.23
0.77
0.46
Original
0.36
0.00
0.00
0.03
0.76
0.21
Exclusive
0.10
0.00
0.13
0.00
0.38
0.26
Sporty
0.00
0.00
0.40
0.27
0.45
1.00
Powerful
0.00
0.12
0.70
0.02
0.00
0.74
Outdoor
0.00
0.29
0.00
0.07
0.00
0.57

As the number of rows and columns increases, these matrices become more and more cumbersome. Although we do not require a heatmap for this cross-tabulation, we will when the rows of the data matrix represent individual respondents. Now is a good time to introduce such a heatmap since we have the basis and coefficient matrices from which they are built. The basis heatmap showing the association between the vehicles and the latent features will be shown first. Lots of yellow is good for it indicates simple structure. As suggested in earlier post, NMF is easiest to learn if we use the language of factor analysis and simple structure implies that each car is associated with only one latent feature (one reddish block per row and the rest pale or yellow).


The Toyota Prius falls at the bottom where it "loads" on only the first column. Looking back at the basis matrix, we can see the actual numbers with the Prius having a weight of 0.82 on the first latent feature that we named "Green" because of its association with the observed features in the Coefficient Matrix that seem to measure an environmental or green construct. The other columns and vehicles are interpreted similarly, and we can see that the heatmap is simply a graphic display of the basis matrix. It is redundant when there are few rows and columns. It will become essential when we have 1000 respondents as the rows of our data matrix.

For completeness, I will add the coefficient heatmap displaying the coefficient matrix before it was transposed. Again, we are looking for simple structure with observed features associated with only one latent feature. We have some degree of success, but you can still see overlap between family (latent feature #2) and luxury (latent feature #3) and between popular (#4) and city (#5).


We observed the same pattern defined by the same six latent features as that reported by Meulders using a probabilistic latent feature model. That is, one can simply compare the estimated object and attribute parameters from the JSS article (p. 12) and the two matrices above to confirm the correspondence with correlations over 0.90 for all six latent variables. However, we have reached the same conclusions via very different statistical models. The plfm is a process model specifying a cognitive model of how object-attribute associations are formed. NMF is a matrix factorization algorithm from linear algebra.

The success of NMF has puzzled researchers for some time. We like to say that the nonnegative constraints direct us toward separating the whole into its component parts (Lee and Seung). Although I cannot tell you why NMF seems to succeed in general, I can say something about why it works with consumer data. Products do well when they deliver communicable benefits that differentiate them from their competitors. Everyone knows the reasons for buying a BMW even if they have no interest in owning or driving the vehicle. Products do not survive in a competitive market unless their perceptions are clear and distinct, nor will the market support many brands occupying the same positioning. Early entries create barriers so that additional "me-too" brands cannot enter. Such is the nature of competitive advantage. As a result, consumer perceptions can be decomposed into their separable brand components with their associated attributes.

Discrete or Continuous Latent Structure?

Of course, my answer has already been given in a prior spoiler alert. We do both using dimensions for the big picture and features for more detailed comparisons. The market is separable into brands offering differentiated benefits. However, this categorization has a dissimilarity structure. The categories are contrastive, which is what creates the dimensions. For example, the luxury-economy dimension from the CA is not a quantity like length or weight or volume in which more is more of the same thing. Two liters of water is just the concatenation of two one-liter volumes of water. Yet, no number of economy cars make a luxury automobile. These axes are not quantities but dimensions that impose a global ordering on the vehicle types while retaining a local structure defined by the features.

Hopefully, one last example will clarify this notion of dimension-as-ordering-of-distinct-types. Odors clearly fall along an approach-avoidance continuum. Lemons attract and sewers repel. Nevertheless, odors are discrete categories even when they are equally appealing or repulsive. A published NMF analysis of the human odor descriptor space used the term "categorical dimensions" because the "odor space is not occupied homogeneously, but rather in a discrete and intrinsically clustered manner." Brands are also discrete categories that can be ordered along a continuum anchored by most extreme features at each end. Moreover, these features that we associate with various brands differ in kind and not just intensity. Both the brands and the features can be arrayed along the same dimensions, however, those dimensions contain discontinuities or gaps where there are no intermediate brands or features.

Applying the concept of categorical dimensions to our perceptual data suggests that we may wish to combine the correspondence map and the NMF using a neighborhood interpretation of the map with the neighborhoods determined by the latent features of the NMF. Such a diagram is not uncommon in multidimensional scaling (MDS) where circles are drawn around the points falling into the same hierarchical clusters. Kruskal and Wish give us an example in Figure 14 (page 44). In 1978, when their book was published, hierarchical cluster analysis was the most likely technique for clustering a distance matrix. MDS and hierarchical clustering use the same data matrix, but make different assumptions concerning the distance metric. Yet, as with CA and NMF, when the structure is well-formed, the two methods yield comparable results.

In the end, we are not forced to decide between categories or dimensions. Both CA and NMF scale rows and columns simultaneously. The dimensions of CA order those rows and columns along continuum with gaps and clusters. This is the nature of ordinal scales that depend not on intensity or quantity but on the stuff that is being scaled. In a similar manner, the latent features or categories of NMF have a similarity structure and can be ordered. The term "categorical dimensions" captures this hybrid scaling that is not exclusively continuous or categorical.