Saturday, June 21, 2014

Separating Statistical Models of "What Is Learned" from "How It Is Learned"

Something triggers our interest. Possibly it's an ad, a review or just word of mouth. We want to know more about the movie, the device, the software, or the service. Because we come with different preferences and needs, our searches vary in intensity. For some it is one and done, but others expend some effort and seek out many sources. My last post on the consumer decision journey laid out the argument for using nonnegative matrix factorization and the R package nmf to identify the different pathways taken in the search for product information.

Information Search ≠ Knowledge Acquired

It is easy to confuse the learning process with what is learned. The internet gives consumers control over their information search, and they are free to improvise as they wish. However, what is learned remains determined by the competition in the product category. What knowledge do we acquire as we search online or in person? Careful, there is no exam, therefore we are not required to be objective or thorough. Andy Clark reminds us that "...minds evolved to make things happen." So we learn what is available and what we want because we intend to make a purchase. We learn only what we need to know to make a choice.

The marketer and the consumer join forces to simplify the purchase process so that only a limited amount of information search and knowledge acquisition is needed to reach a satisfying decision. When the choice is hard, only a few buy. The simplification is a one-dimensional array of features and benefits running from the basic to the premium product, from the least to the most expensive. Every product category offers alternatives that are good, better, and best. Learning this is not difficult since everyone is ready and willing to help. The marketing department, the experts who review and recommend, and even other users will let you know what features differentiate the good from the better and the better from the best. One cannot search long for product information without learning what features are basic, what features are added to create the next quality level, and finally what features indicate a premium product.

In the end, we require one statistical model for analyzing how well the brand is doing and a different statistical model for investigating the pathways taken in the consumer decision journey. As we have already seen, R provides a method for uncovering the learning pathways with matrix factorization packages such as nmf. Brand performance or achievement (what is learned) can be modeled using latent-trait or item response theory (see the section "Thinking Like an Item Response Theorist"). I have provided more detail in previous posts showing how to analyze both checklists and rating scales.

Marketers have resource allocation decisions to make. They need to put their money where consumers are looking. When marketing is successful, the brand will do well. Since process and product are distinct systems governed by different rules, each demands its own statistical model and associated measurement procedures.

Friday, June 13, 2014

Identifying Pathways in the Consumer Decision Journey: Nonnegative Matrix Factorization

The Internet has freed us from the shackles of the yellow page directory, the trip to the nearby store to learn what is available, and the forced choice among a limited set of alternatives. The consumer is in control of their purchase journey and can take any path they wish. But do they? It's a lot of work for our machete-wielding consumer cutting their way through the product jungle. The consumer decision journey is not an itinerary, but neither is it aimless meandering. Perhaps we do not wish to follow the well-worn trail laid out by some marketing department. The consumer is free to improvise, not by going where no one has gone before, but by creating personal variation using a common set of journey components shared with others.

Even with all the different ways to learn about products and services, we find constraints on the purchase process with some touchpoint combinations more likely than others. For example, one could generate a long list of all the possible touchpoints that might trigger interest, provide information, make recommendations, and enable purchase. Yet, we would expect any individual consumer to encounter only a small proportion of this long list. A common journey might be no more than a seeing an ad followed by a trip to a store. For frequently purchased products, the entire discovery-learning-comparing-purchase process could collapse into a single point-of-sale (PoS) touchpoint, such as product packaging on a retail shelf.

The figure below comes from a touchpoint management article discussing the new challenges of online marketing. This example was selected because it illustrates how easy it is to generate touchpoints as we think of all the ways that a consumer interacts with or learns about a product. Moreover, we could have been much more detailed because episodic memory allows us to relive the product experience (e.g., the specific ads seen, the packaging information attended to, the pages of the website visited). The touchpoint list quickly gets lengthy, and the data matrix becomes sparser because an individual consumer is not likely to engage intensively with many products. The resulting checklist dataset is a high-dimensional consumer-by-touchpoint matrix with lots of columns and cells containing some ones but mostly zeroes.

It seems natural to subdivide the columns into separate modes of interaction as shown by the coloring in the above figure (e.g., POS, One-to-One, Indirect, and Mass). It seems natural because different consumers rely on different modes to learn and interact with product categories. Do you buy by going to the store and selecting the best available product, or do you search and order online without any physical contact with people or product? Like a Rubik's cube, we might be able to sort rows and columns simultaneously so that the reordered matrix would appear to be block diagonal with most of the ones within the blocks and most of the zeroes outside. You can find an illustration in a previous post on the reorderable data matrix. As we shall see later, nonnegative matrix factorization "reorders" indirectly by excluding negative entries in the data matrix and its factors. A more direct approach to reordering would use the R packages for biclustering or seriation. Both of these links offer different perspectives on how to cluster or order rows and columns simultaneously.

Nonnegative Matrix Factorization (NMF) with Simulated Data

I intend to rely on the R package NMF and a simulated data set based on the above figure. I will keep it simple and assume only two pathways: an online journey through the 10 touchpoints marked with an "@" in the above figure and an offline journey through the remaining 20 touchpoints. Clearly, consumers are more likely to encounter some touchpoints more often than others, so I have made some reasonable but arbitrary choices. The R code at the end of this post reveals the choices that were made and how I generated the data using the sim.rasch function from the psych R package. Actually, all you need to know is that the dataset contains 400 consumers, 200 interacting more frequently online and 200 with greater offline contact. I have sorted the 30 touchpoints from the above figure so that the first 10 are online (e.g., search engine, website, user forum) and the last 20 are offline (e.g., packaging information, ad in magazine, information display). Although the patterns within each set of online and offline touchpoints are similar, the result is two clearly different pathways as shown by the following plot.

It should be noted that the 400 x 30 data matrix contained mostly zeroes with only 11.2% of the 12,000 cells indicating any contact. Seven of the respondents did not indicate any interaction at all and were removed from the analysis. The mode was 3 touchpoints per consumer, and no one reported more than 11 interactions (although the verb "reported" might not be appropriate to describe simulated data).

If all I had was the 400 respondents, how would I identify the two pathways? Actually, k-means often does quite well, but not in this case with so many infrequent binary variables. Although using the earlier mentioned biclustering approach in R, Dolnicar and her colleagues will help us understand the problems encounters when conducting market segmentation with high-dimensional data. When asked to separate the 400 into two groups, k-means clustering was able to identify correctly only 55.5% of the respondents. Before we overgeneralize, let me note that k-means performed much better when the proportions were higher (e.g., raise both lines so that they peak above 0.5 instead of below 0.4), although that is not much help with high-dimensional scare data.

And, what about NMF? I will start with the results so that you will be motivated to remain for the explanation in the next section. Overall, NMF placed correctly 81.4% of the respondents, 85.9% of the offline segment and 76.9% of the online segment. In addition, NMF extracted two latent variables that separated the 30 touchpoints into the two sets of 10 online and 20 offline interactions.

So, what is nonnegative matrix factorization?

Have you run or interpreted a factor analysis? Factor analysis is matrix factorization where the correlation matrix R is factored into factor loadings: R = FF'. Structural equation modeling is another example of matrix factorization, where we add direct and indirect paths between the latent variables to the factor model connecting observed and latent variables. However, unlike the two previous models that factor the correlation or variance-covariance matrix among the observed variables, NMF attempts to decompose the actual data matrix.

Wikipedia uses the following diagram to show this decomposition or factorization:

The matrix V is our data matrix with 400 respondents by 30 touchpoints. A factorization simplifies V by reducing the number of columns from the 30 observed touchpoints to some smaller number of latent or hidden variables (e.g., two in our case since we have two pathways). We need to rotate the H matrix by 90 degrees so that it is easier to read, that is, 2x30 to 30x2. We do this by taking the transpose, which in R code is t(H).

Search engine
Price comparison
Hint from Expert
User forum
Banner or Pop-up
E-mail request
Packaging information
PoS promotion
Recommendation friends
Show window
Information at counter
Advertising entrance
Editorial newspaper
Consumer magazine
Ad in magazine
Personal advice
Information screen
Information display
Customer magazine
Catalog loyalty program
Offer loyalty card
Service hotline

As shown above, I have labeled the columns to reflect their largest coefficients in the same way that one would name a factor in terms of its largest loadings. To continue with the analogy to factor analysis, the touchpoints in V are observed, but the columns of W and the rows of H are latent and named using their relationship to the touchpoints. Can we call these latent variables "parts," as Seung and Lee did in their 1999 article "Learning the Parts of Objects by NMF"? The answer depends on how much overlap between the columns you are willing to accept. When each row of H contains only one large positive value and the remaining columns for that row are zero (e.g., Website in the third row), we can speak of latent parts in the sense that adding columns does not change the impact of previous columns but simply adds something new to the V approximation.

So in what sense is online or offline behavior a component or a part? There are 30 touchpoints. Why are there not 30 components? In this context, a component is a collection of touchpoints that vary together as a unit. We simulated the data using two different likelihood profiles. The argument called d in the sim.rasch function (see the R code at the end of this post) contains 30 values controlling the likelihood that the 30 touchpoints will be assigned a one. Smaller values of d result in higher probabilities that the touchpoint interaction will occur. The coefficients in each latent variable of H reflect those d values and constitute a component because the touchpoints vary together for 200 individuals. Put another way, the whole with 400 respondents contains two parts of 200 respondents each and each with its own response generation process.

The one remaining matrix, W, must be of size 400x2 (# respondents times # latent variables). So, we have 800 entries in W and 60 cells in H compared to the 12,000 observed values in V. W has one row for each respondent. Here are the rows of W for the 200th and 201st respondents, which is the dividing line between the two segments:
200 0.00015 0.00546
201 0.01218 0.00038
The numbers are small because we are factoring a data matrix of zeroes and ones. But the ratios of these two numbers are sizeable. The 200th respondent has an offline latent score (0.00546) more than 36 times its online latent score (0.00015), and the ratio for the 201st respondent is more than 32 in the other direction with online dominating. Finally, in order to visualize the entire W matrix for all respondents, the NMF package will produce heatmaps like the following with the R code basismap(fit, Rowv=NA).
As before, the first column represent online and the second points to offline. The first 200 rows are offline respondents or our original Segment 1 (labeled basis 2), and the last 200 or our original Segment 2 were generated using the online response pattern (labeled basis 1). This type of relabeling or renumbering occurs over and over again in cluster analysis, so we must learn to live with it. To avoid confusion, I will repeat myself and be explicit.

Basis 2 is our original Segment 1 (Offliners).
Basis 1 is our original Segment 2 (Onliners).

As mentioned earlier, Segment 1 offline respondents had a higher classification accuracy (85.9% vs. 76.9%). This is shown by the more solid and darker red lines for the first 200 offline respondents in the second basis 2 column.

Consumer Improvisation Might Be Somewhat More Complicated

Introducing only two segments with predominantly online or offline product interactions was a simplification necessary to guide the reader through an illustrative example. Obviously, the consumer has many more components that they can piece together on their journey. However, the building blocks are not individual touchpoints but set of touchpoints that are linked together and operate as a unit. For example, visiting a brand website creates opportunities for many different micro-journeys over many possible links on each page. Recurring website micro-journeys experienced by several consumers would be identified as a latent components in our NMF analysis. At least, this what I have found using NMF with touchpoint checklists from marketing research questionnaires.

R Code to Reproduce All the Analysis in this Post
offline<-sim.rasch(nvar=30, n=200, mu=-0.5, sd=0,
online<-sim.rasch(nvar=30, n=200,  mu=-0.5, sd=0,
names(tp)<-c("Search engine",
             "Price comparison",
             "Hint from Expert",
             "User forum",
             "Banner or Pop-up",
             "E-mail request",
             "Packaging information",
             "PoS promotion",
             "Recommendation friends",
             "Show window",
             "Information at counter",
             "Advertising entrance",
             "Editorial newspaper",
             "Consumer magazine",
             "Ad in magazine",
             "Personal advice",
             "Information screen",
             "Information display",
             "Customer magazine",
             "Catalog loyalty program",
             "Offer loyalty card",
             "Service hotline")
seg_profile<-t(aggregate(tp, by=list(segment), FUN=mean))
    max(seg_profile[2:30,])), type="n",
    xlab="Touchpoints (First 10 Online/Last 20 Offline)", 
    ylab="Proportion Experiencing Touchpoint")
lines(seg_profile[2:30,1], col="blue", lwd=2.5)
lines(seg_profile[2:30,2], col="red", lwd=2.5)
       c("Offline","Online"), lty=c(1,1),
       lwd=c(2.5,2.5), col=c("blue","red"))
tp_cluster<-kmeans(tp[rows>0,], 2, nstart=25)
fit<-nmf(tp[rows>0,], 2, "frobenius")

Created by Pretty R at

Wednesday, June 4, 2014

The Unavoidable Instability of Brand Image

"It may be that most consumers forget the attribute-based reasons why they chose or rejected the many brands they have considered and instead retain just a summary attitude sufficient to guide choice the next time."
This is how Dolnicar and Rossiter conclude their paper on the low stability of brand-attribute associations. Evidently, we need to be very careful how we ask the brand image question in order to get test-retest agreement over 50%. "Is the Fiat 500 a practical car?" Across all consumers, those that checked "Yes" at time one will have only a 50-50 chance of checking "Yes" again at time two, even when the time interval is only a few weeks. Perhaps, brand-attribute association is not something worth remembering since consumers do not seem to remember all that well.

In the marketplace a brand attitude, such as an overall positive or negative affective response, would be all that a consumer would need in order to know whether to approach or avoid any particular brand when making a purchase decision. If, in addition, a consumer had some way of anticipating how well the brand would perform, then the brand image question could be answered without retrieving any specific factual memories of the brand-attribute association. By returning the consumer to the purchase context, the focus is placed back on the task at hand and what needs to be accomplished. The consumer retrieves from memory what is required to make a purchase. Affect determines orientation, and brand recognition provides performance expectations. Buying does not demand a memory dump. Recall is selective. More importantly, recall is constructive.

For instance, unless I have tried to sell or buy a pre-owned car, I might not know whether a particular automobile has a high resale value. In fact, if you asked me for a dollar value, that number would depend on whether I was buying or selling. The buyer is surprised (as in sticker shock) by how expensive used cars can be, and the seller is disappointed by how little they can get for their prized possession. In such circumstances, when asked if I associate "high resale value" with some car, I cannot answer the factual question because I have no personal knowledge. So I answer a different, but easier, question instead. "Do I believe that the car has high resale value?" Respondents look inward and ask themselves, introspectively, "When I say 'The car has high resale value,' do I believe it to be true?" The box is checked if the answer is "Yes" or a rating is given indicating the strength of my conviction (feelings-as-information theory). Thus, perception is reality because factual knowledge is limited and unavailable.

How might this look in R?

A concrete example might be helpful. The R package plfm includes a data set with 78 respondents who were asked whether or not they associated each of 27 attributes with each of 14 European car models. That is, each respondent filled in the cells of a 14 x 27 table with the rows as cars and the columns as attributes. All the entries are zero or one identifying whether the respondent did (1) or did not (0) believe that the car model could be described with the attribute. By simply summing across the 78 different tables, we produce the aggregate cross-tabulation showing the number of respondents from 0 to 78 associating each attribute with each car model. A correspondence analysis provides a graphic display of such a matrix (see the appendix for all the R code).

Well, this ought to look familiar to anyone working in the automotive industry. Let's work our way around the four quadrants: Quadrant I Sporty, Quadrant II Economical, Quadrant III Family, and Quadrant IV Luxury. Another perspective is to see an economy-luxury dimension running from the upper left to the lower right and a family-sporty dimension moving from the lower left to the upper right (i.e., drawing a large X through the graph).

I have named these quadrants based only on the relative positions of the attributes by interpreting only the distances between the attributes. Now, I will examine the locations of the car models and rely only the distances between the cars. It appears that the economy cars, including the partially hidden Fiat 500, fall into Quadrant II where the Economical attributes also appear. The family cars are in Quadrant III, which is where the Family attributes are located. Where would you be if you were the BMW X5? Respondents would be likely to associate with you the same attributes as the Audi A4 and the Mercedes C-class, so you would find yourself in the cluster formed by these three car models.

Why am I talking in this way? Why don't I just say that the BMW X5 is seen as Powerful and therefore placed near its descriptor? I have presented the joint plot from correspondence analysis, which means that we interpret the inter-attribute distances and the inter-car distances but not the car-attribute distances. It is a long story with many details concerning how distances are scaled (chi-square distances), how the data matrix is decomposed (singular value decomposition), and how the coordinates are calculated. None of this is the focus of this post, but it is so easy to misinterpret a perceptual map that some warning must be issued. A reference providing more detail might be helpful (see Figure 5c).

Using the R code at the end of this post, you will be able to print out the crosstab. Given the space limitation, the attribute profiles for only a few representative car models have been listed below. To make it easier, I have ordered the columns so that the ordering follows the quadrants: the Mazda MX5 is sporty, the Fiat 500 is city focus, the Renault Espace is family oriented, and the BMW X5 is luxurious. When interpreting these frequencies, one needs to remember that it is the relative profile that is being plotted on the correspondence map. That is, two cars with the same pattern of high and low attribute associations would appear near each other even if one received consistently higher mentions. You should check for yourself, but the map seems to capture the relationships between the attributes and the cars in the data table (with the exception of Prius to be discussed next).

Mazda MX5 Fiat 500 Renault Espace BMW X5 VW Golf Toyota Prius
Sporty 65 8 1 47 29 8
Nice design 40 35 17 31 20 9
Attractive 39 40 12 36 33 10
City focus 9 58 5 1 30 26
Agile 22 53 9 15 40 10
Economical 3 49 17 1 29 42
Original 22 37 7 8 5 19
Family Oriented 1 3 74 41 12 39
Practical 6 39 52 23 44 16
Comfortable 12 6 47 46 27 23
Versatile 5 5 39 30 25 21
Luxurious 28 6 10 58 12 11
Powerful 37 1 9 57 20 9
Status symbol 39 12 6 51 23 16
Outdoor 13 1 20 46 6 4
Safe 4 5 23 40 40 19
Workmanship 13 3 4 28 14 19
Exclusive 17 14 3 19 0 8
Reliable 17 11 17 38 58 27
Popular 5 24 27 13 55 10
Sustainable 8 7 18 19 43 29
High trade-in value 4 3 0 36 41 4
Good price-quality ratio 11 20 15 7 30 21
Value for the money 9 7 12 8 24 10
Environmentally friendly 6 32 7 2 20 51
Technically advanced 17 2 6 32 10 46
Green 0 10 2 2 6 36

Now, what about Prius? I have included in the appendix the R code to extract a third dimension and generate a plot showing how this third dimension separates the attributes and the cars. If you run this code, you will discover that the third dimension separates Prius from the other cars. In addition, Green and Environmentally Friendly can be found nearby, along with "Technically Advanced." You can visualize this third dimension by seeing Prius as coming out of the two-dimensional map along with the two attributes. This allows us to maintain the two-dimensional map with Prius "tagged" as not as close to VW Golf as shown (e.g., shadowing the Prius label might add the desired 3D effect).

The Perceptual Map Varies with Objects and Features

What would have happened had Prius not be included in association task? Would the Fiat 500 been seen as more environmentally friendly? The logical response is to be careful about what cars to include in the competitive set. However, the competitive set is seldom the same for all car buyers. For example, two consumers are considering the same minivan, but one is undecided between the minivan and a family sedan and the other is debating between the minivan and a SUV. Does anyone believe that the comparison vehicle, the family sedan or the SUV, will not impact the minivan perceptions? The brand image that I create in order complete a survey is not the brand image that I construct in order to make a purchase. The correspondence map is a spatial representation of this one particular data matrix obtained by recruiting and surveying consumers. It is not the brand image.

As I have outlined in previous work, brand image is not simply a network of association evoked by a name, a package, or a logo. Branding is a way of seeing, or as Douglas Holt describes it, "a perceptual frame structuring product experience." I used the term "affordance" in my earlier post to communicate that brand benefits are perceived directly and immediately as an experience. Thus, brand image is not a completed project, stored always in memory, and waiting to be retrieved to fill in our brand-attribute association matrix. Like preference, brand image is constructed anew to complete the task at hand. The perceptual frame provides the scaffolding, but the specific requirements of each task will have unique impacts and instability is unavoidable.

Even if we attempt to keep everything the same at two points in time, the brand image construction process will amplify minor fluctuations and make it difficult for an individual to reproduce the same set of responses each time. However, none of this may impact the correspondence map for we are mapping aggregate data, which can be relatively stable even with considerable random individual variation. Yet, such instability at the individual level must be disturbing for the marketer who believes that brand image is established and lasting rather than a construction adapting to the needs of the purchase context.

The initial impulse is to save brand image by adding constraints to the measurement task in order to increase stability. But this misses the point. There is no true brand image to be measured. We would be better served by trying to design measurement tasks that mimic how brand image is constructed under the conditions of the specific purchase task we wish to study. The brand image that is erected when forming a consideration set is not the brand image that is assembled when making the final purchase decision. Neither of these will help us understand the role of image in brand extensions. Adaptive behavior is unstable by design.

Appendix with R code:

plot(ca, conf=NULL)
ca3<-anacor(car$freq1, ndim=3)
plot(ca3, plot.dim=c(1,3), conf=NULL)
Created by Pretty R at