Wednesday, April 15, 2015

Recommending Recommender Systems When Preferences Are Not Driven By Simple Features

Why does lifting out a slice make the pizza appear more appealing?
We can begin our discussion with the ultimate feature bundle - pizza toppings. Technically, a menu would only need to list all the toppings and allow the customers to build their own pizza. According to utility maximization, choice is a simple computation with the appeal of any pizza calculated as a function of the utilities associated with each ingredient. This is the conjoint assumption that the attraction to the final product can be reduced to the values of its features. Preference resides in the features, and the appeal of the product is derived from its particular configuration of feature levels. 

Yet, pizza menu design is an active area of discussion with many websites offering advice. In practice, listing ingredients does not generate the sales that are produced by the above photo or even a series of suggestive names with associated toppings. Perhaps this is why we see the same classic combinations offered around the world, as shown below in this menu from Bali.


I am not denying that a vegetarian does not want ham on their pizza. Ingredients do matter, but what is the learning process? Does preference formation begin with taste tests of the separate toppings heated in a very hot oven? "Oh, I like the taste of roasted black olives. Let me try that on my pizza." No, we learn what we like by eating pizzas with topping combinations that appear on menus because others before have purchased such configurations in sufficient quantity and at a profitable price for the restaurant. Moreover, we should not forget that we learn what we like in the company of others who are more than happy to tell us what they like and that we should like it too.

The expression "what grows together goes together" suggests that tastes are acquired initially by pairing what is locally available in abundance. It is the entire package that attracts us, which explains why pizza seems more appealing in the above photo. If we were of an experimental mindset, we might systematically vary the toppings one pizza at a time and reach some definitive conclusions concerning our tastes. However, it is more likely that consumers invent or repeat a rationale for their behavior based on minimal samplings. That is, consumer inference may function more like superstition than science, less like Galileo and more like an establishment with other interests to be served. At least we should consider the potential impact of the preference formation process and keep our statistical models open to the possibility that consumer go beyond simple features when they represented products in thought and memory (see Connecting Cognition and Consumer Choice).

Simply stated, you get what you ask. Features become the focus when all you have are choice sets where the alternatives are the cells of an optimal experimental design. Clearly, the features are changing over repeated choice sets and the consumer responds to such manipulations. If we are careful and mimic the marketplace, our choice modeling will be productive. I have tried to illustrate in previous posts how R can generate choice designs and provide individual-level hierarchical Bayes estimates along with some warnings about overgeneralizing. For example, choice modeling works just fine when the purchase context is label comparisons among a few different products sitting on the retail shelf. 

Instead, what if we showed actual pizza menus from a large number of different restaurants? Does this sound familiar? What is we replace pizzas with movies or songs? This is the realm of recommender systems where the purchase context is filled with lots of alternatives arrayed across a landscape of highly correlated observed features generated by a much smaller set of latent features. We have entered the world of representation learning. Choice modeling delivers when the purchase context requires that we compare products and trade off features. Recommendation systems, on the other hand, thrive when there is more available than any one person can experience (e.g., the fragmented music market).

To be clear, I am using the term "recommendation systems" because they are familiar and we all use them when we shop online or search the web. Actually, any representational system that estimates missing values by adding latent variables will do nicely (see David Blei for a complete review). However, since reading a description of the computation behind the winning of the Netflix Prize, I have relied on matrix factorization as a less demanding approach that still yields substantial insight with marketing research data. In particular, the NMF package in R offers such an easy to use interface to nonnegative matrix factorization that I have used it over and over again with repeated success. You will find a list of such applications at the end of my post on Brand and Product Category Representation.






Friday, April 10, 2015

Modeling Categories with Breadth and Depth

Religion is a categorical variable with followers differentiated by their degree of devotion. Liberals and conservatives check their respective boxes when surveyed, although moderates from each group sometimes seem more alike than their more extreme compatriots. All Smartphone users might be classified as belonging to the same segment, yet the infrequent user is distinct from the intense who cannot take their eyes off their screens. Both of us have the flu, but you are really sick. My neighbor and I belong to a cluster called dog-owners. However, my dog is a pet on a strict allowance and theirs is a member of the family with no apparent limit on expenditures. There seems to be more structure underlying such categorizations than is expressed by a zero-one indicator of membership.

Plutchik's wheel of emotion offers a concrete example illustrating the breadth and depth of affective categories spanning the two dimensions defined by positive vs. negative and active vs. passive. The concentric circles in the diagram below show the depth of the emotion with the most intense toward the center. Loathing, disgust and boredom suggest an increasing activation of a similar type of negative response between contempt and remorse. Breadth is varied as one moves around the circle traveling through the entire range of emotions. When we speak of opposites such as flight (indicated in green as apprehension, fear and terror) or fight (the red emotions of annoyance, anger and rage), we are relating our personal experiences with two contrasting categories of feelings and behaviors. Yet, there is more to this categorization than is express by a set of exhaustive and mutually exclusive boxes, even if the boxes are called latent classes.



The R statistical language approaches such structured categories from a number of perspectives. I have written two posts on archetypal analysis. The first focused on the R package archetypes and demonstrated how to use those functions to model customer heterogeneity by identifying the extremes and representing everyone else as convex combinations of those end points. The second post argued that we tend to think in terms of contrasting ideals (e.g., liberal versus conservative) so that we perceive entities to be more different in juxtaposition than they would appear on their own.

Latent variable mixture models provide another approach to the modeling of unobserved constructs with both type and intensity, at least for those interested in psychometrics in R. The intensity measure for achievement tests is item difficulty, and differential item functioning (DIF) is the term used by psychometricians to describe items with different difficulty parameters in different groups of test takers. The same reasoning applies to customers who belong to different segments (types) seeking different features from the same products (preference intensity). These are all mixture models in the sense that we cannot estimate one set of parameters for the entire sample because the data are a mixture of hidden groups with different parameters. R packages with the prefix or suffix "mix" in their title (e.g., mixRasch or flexmix) suggest such a mixture modeling.

R can also capture the underlying organization shaping categories through matrix factorization. This blog is filled with posts demonstrating how the R package NMF (nonnegative matrix factorization) easily decomposes a data matrix into the product of two components: (1) something that looks like factor loadings of the measures in the columns and (2) something similar to a soft or fuzzy clustering of respondents in the rows. Both these components will be block diagonal matrices when we have the type of separation we have been discussing. You can find examples of such matrix decompositions in a listing of previous posts at the end of my discussion of brand and product representation.

Consumer segments are structured by their use of common products and features in order to derive similar benefits. This is not an all-or-none process but a matter of degree specified by depth (e.g., usage frequency) and breadth (e.g., the variety and extent of feature usage). You can select almost any product category, and at one end you will find heavy users doing everything that can possibly be done with the product. As usage decreases, it falls off in clumps with clusters of features no longer wanted or needed. These are the latent features of NMF that simultaneously bind together consumers and the features they use. For product categories such as movies or music, the same process applies but now the columns are films seen or recordings heard. All of this may sound familiar to those of you who have studied recommendation systems or topic modeling, both of which can be run with NMF.

Friday, March 20, 2015

What Consumers Learn Before Deciding to Buy: Representation Learning

Features form the basis for much of our preference modeling. When asked to explain one's preferences, features are typically accepted as appropriate reasons: this job paid more, that candidate supports tax reform, or it was closer to home. We believe that features must be the drivers since they so easily serve as rationales for past behavior. Choice modeling formalizes this belief by assuming that products and services are feature bundles with the value of the bundle calculated directly from the utilities of its separate features. All that we need to know about a product or service can be represented as the intersection of its features, which is why it is called conjoint analysis.

At first, this approach seems to work, but it does not scale well. We create hypothetical products and services defined by the cells in a factorial experimental design (see the book Stated Preference Methods Using R). The number of cells increases quickly with each additional feature so that we need to turn to optimal designs in R in order to limit the number of possible combinations. We have reduced the number of hypothetical descriptions, while the number of estimated parameters remains unchanged. Overall preference continues to be an additive function of the values attributed to each of the separate components.

Representation learning, on the other hand, is associated with deep neural networks, such as the h2o package discussed by John Chambers at the useR! 2014 conference. According to Yoshua Bengio (see his new chapter on Distributed Representations), "a good representation is one that makes further learning tasks easy." The process is described in his first chapter on Deep Learning. As shown in this figure from Wikipedia, the observed features are visible units and the product representation is a transformation contained in hidden units.


What do consumers learn before deciding to buy? They learn a representational structure that reduces the complexity of the purchase process. This learning comes relatively easy with so many sources telling us what to look for and what to buy (e.g., marketing communications, professional reviews, social media and of course, friends and family). Bengio speaks of evolving culture vs. local minima as the process for "brain to brain transfer of information." Others refer to it as a meeting of minds or shared conceptualizations.

Are you thinking about a Smart Watch? Representation learning would suggest that the first step is "getting a lay of the land" or untangling the sources of variation accounting for differences among the offerings. I outlined such an approach in my last post on precursors to preference construction. It is possible to go online and request side-by-side feature comparisons that look similar to what one might find in choice modeling. However, that step is often late in the process after you have decided to purchase and have narrowed your consideration set. Before that, one looks at pictures, scans specifications, reads reviews and learns from others through user comments. One discovers what is available and what benefits are delivered. As you learn what if offered, you come to understand what you might want and be willing to spend.

The purchase task is somewhat easier than language translation or facial recognition because product categories are marketing creations with a deliberately simplified structure. Products and services are simple by design with benefits and features linked together and set to music with a logo and a tagline. Product and service features are observed (red in the above figure); benefits are latent or hidden features (the blue) and can be extracted with deep neural networks or nonnegative matrix factorization. That is, we can think of representation learning as the relatively slow unsupervised learning that occurs early in the decision process and makes later learning and decision making easier and faster. Utility theory lacks the expressive power to transform the input into new ways of seeing. Both deep neural networks and nonnegative matrix factorization free us to go beyond the information given.

Finally, what happens when the consumer is pulled out of the purchase context and presented feature lists constructed according to a fractional factorial or optimal design? The norms of the marketplace are violated, yet respondents get through the task the best they can using the only information that you have provided them. Unfortunately, you do not learn much about bears in the wild when they are confined in cages.




Thursday, March 5, 2015

Brand and Product Category Representation: Precursors to Preference Construction

Evidently, preference is contextual, or so The Hershey's Company claims in their advertising. I agree and will not repeat the argument made in a previous post on incorporating preference construction into the choice modeling process.

Within the framework of utility theory and conjoint analysis, R provides both an introduction (Stated Preference Methods using R) and access to advanced algorithms (hierarchical Bayes choice modeling). However, generalization remains a problem. The experimental procedures that elicit stated preference are not the same as those in the marketplace where purchases are made for differing occasions, purposes and participants. Preferences are not well-formed and stable, but constructed on the fly within the choice context. Even price sensitivity depends on framing, which is why we see such robust and resistant order-effects when costs are increasing versus decreasing (e.g., a 10% price increase seems less objectionable when it comes after a proposed 15% increment than when it comes after a 5% raise).

Such context dependence is the reason why so many of us who use choice modeling in marketing research seek to limit the number of attributes and their associated levels and demand that the experimental arrangements mimic as closely as possible the actual purchase process. But even with such restrictions, the repetition from presenting several choice sets leads consumers to focus on what is varied and induces sensitivities that would not be found in the marketplace where these attributes levels would be constant or difficult to find. Moreover, our designs attempt to keep attributes independent so that we can estimate separate effects for every attribute. Yet, customers enter with conceptual structures that link these attributes (e.g., larger quantities are discounted and premium brands cost more). Do we disrupt the purchase process when we ignore such shared conceptual spaces?

Consumers learn quite a lot about a product category before they decide to purchase anything. The SmartWatch will serve as a good example because it is relatively new and still evolving. The name "SmartWatch" invites us to transfer what we know about SmartPhones and their relationship to cell phones. There has to be brands (where's Apple?) and alternative versions running from basic to premium (good-better-best). Considerers will be talking to others and reading reviews telling them what is the best device for their individual usage and needs. This is the product representation that one learns in order to decide to enter a product category. You answer the question "Do I really need or want a SmartWatch?" by learning what is available and deciding what you are willing to spend to obtain it. When we enter the marketplace, we enter with this shared representation and we tradeoff specific features or pricing offers within this understanding. Those of you with machine learning backgrounds might wish to think of this as a form of unsupervised feature learning.

R provides the interface for representation learning about brands and products categories. Although one has a number of alternatives, I will keep it simple and discuss only one approach, nonnegative matrix factorization (NMF). I am thinking of feature or representation learning as a form of data reduction or manifold learning as outlined in Section 8 of the Yoshua Bengio et al. review paper. Consumers populate the rows of the data matrix, and the columns might span brand and feature familiarity, benefits and features sought, or expected usage. It is easy to generate a long list of columns just for features alone. Moreover, features are linked to benefits, and both features and benefits sought flow from usage. Obviously, the consumer requires a simpler representation and NMF supplies the building blocks.

Diving into the details, a potential customer wanting to use the SmartWatch in their fitness program would attend to and know about features related to their intended usage. Would they be likely to remember a bunch of specific features, or would they learn what features were standard on the basic version and what features were extras on the more premium models? Brand affordance organizes perceptions along a continuum with different features at the lower and higher ends of the scale. Simultaneously, consumers are differentiated along with the features, for example, some SmartWatch prospects will be interested only in convenience and discretion. The co-clustering produced by matrix factorization provides the underlying representation of both consumers grouped by the benefits and features they seek and those same benefits and features clustered because they are sought by the same consumers.

The R package NMF supplies the interface and several ways to display the results, as I have shown in previous posts:



Monday, January 26, 2015

Wine for Breakfast: Consumption Occasion as the Unit of Analysis

If the thought of a nice Chianti with that breakfast croissant is not that appealing, then I have made by point: occasion shapes consumption. Our tastes have been fashioned by culture and shared practice. Yet, we often ignore the context and run our analyses as if consumers were not nested within situations. Contextual effects are attributed to the person, who is treated as both the unit of observation and the unit of analysis.

Obviously, it would be difficult to interview the occasion. We need informants to learn about wine occasions. Thus, we seek out consumers to tell us when and where they drink what kinds of wines by themselves and with others. Even if one knows little about wine etiquette, the situation imposes such strong constraints that is makes sense to treat the consumption occasion as the unit of analysis. The person serves as the measuring instrument, but the focus is on the determining properties of the occasion.

Continuing with our example, there is a broad range of red and wine varietals that can be purchased in varying containers from a number of different retailers and served in various locations with a diversity of others. The list is long, and it is unlikely that we can ask for the details for more than a couple of consumption occasions before we fatigue our respondents. Yet, it is the specifics that we seek, including the benefits sought and the features considered.

Clearly, there is a self-selection process so that we would expect to find certain types of individuals in each situation. However, the consumption occasion imposes its own rules over and above any selection effect. Therefore, we would anticipate that whatever the reasons for your presence, the occasion will dictate its own norms. In the end, it is reasonable to aggregate the responses of everyone reporting on each consumption occasion and run the analysis with those aggregate responses as the rows. The columns are formed using all the data gathered about the occasion.

And It Isn't Just About Wine and Breakfast (Benefit Structure Analysis)

There are occasions when you use your smartphone to take pictures. If you were thinking about purchasing a new smartphone, you would consider camera ease of use and picture quality remembering those low-light photo that were out of focus and those sunsets where the sun is a blur. Usage occasion seems to impact almost every purchase. You pick your parents up at the airport, so you need four doors, preferably with easy access to the rear seats. Usage is so important that the website or the salesperson always  asks how you intend to use your new acquisition. Context matters whatever you buy (e.g., a washing machine, a garden hose, clothes, cosmetics, sporting equipment, and suitcases).

The goal is to uncover the major sources of variation differentiating among all the consumption occasions. Product differentiation and customer segmentation originate in the usage context. Since opportunities for increased profitability are found in the details, let's pretend we are journalists and ask who, what, where, when, why, and how. These six questions alone can generate a lot of rows, for instance, we obtain some 15,625 possible combinations when we suppose that the answers to each of the six questions could be classified into one of five categories (15,625 = 5x5x5x5x5x5). Of course, most of these rows will be empty because the responses to the six questions are not independent. Yet, 10% is still over 1500 rows, even if many of those rows will be sparse with zero or very small frequencies. Finally, the columns can contain any information collected about the consumption occasions in the rows, though one would expect inquiries concerning benefits sought and features preferred.

Now, we have a large matrix revealing the linkages between many specific occasions and a wide range of benefits and features. It might helpful to revisit the work on Benefit Structure Analysis from the 1970s in order to see how others have analyzed such a matrix. In Exhibit 5 from that Journal of Marketing article, we are presented with a matrix of 51 benefits wanted across 21 cleaning tasks. The solution was a simultaneous row and column linkage analysis, which seems similar to the biclustering that one would achieve today with nonnegative matrix factorization (NMF). As noted in the article, when cleaning furniture, the respondents desired products that removed dust, dirt and film without leaving residues or scratches. On the one hand, there appears to be a structure underlying the cleaning tasks revealed by their shared benefits, On the other hand, the benefits are clustered together by their common association with similar cleaning tasks.

Following that line of reasoning, we can simulate a data matrix by specifying a set of common latent features linking the occasions and the benefits. As outlined in a prior post, the data generating process is an additive superpositioning of building blocks formed by the occasion-benefit linkages. We can begin with some product, for example, coffee. When do we drink coffee, and why do we drink it? Even the shortest list would include starting the day (occasion) in order to jump-start the brain (benefit). Is this a building block? If there were a sizable cohort of first-of-the-day kickstarters who did not drink coffee for the same reasons at other occasions, then we would have a building block.

The data matrix tells us what benefits are sought in each occasion. Neither the occasions nor the benefits are independent. There are times and places when specialty coffee replaces our regular cup. What occasions come to mind when you think about iced or frozen blended coffees? To help us understand this process, I have reproduced a figure from an earlier post.

The associations between the ten occasions labeled A to J and the seven benefits numbered 1 to 7 are indicated by filled squares in Section a. The rows and columns are interchanged as we move from Sections b to c until we see the building blocks in Section d. The solid black and white squares do not show the shades of gray indicating the degree to which coffee drinkers demand the benefit in each occasion. Specifically, Benefit 6 is wanted in both Occasions A, C and H and Occasions D, G, I and E. However, it is likely drinkers are not equally demanding in the two sets of occasions. For example, coffee that starts the day must energize, but the coffee in the afternoon might be primarily a break or a low calorie refreshment. In both cases we are seeking stimulation, just not as much in the afternoon as the first cup of the day.

Benefit structure analysis remains a critical component in any marketing plan. Opportunity is found in the white spaces where benefits are not delivered by the current offerings. Case studies and qualitative research findings fill the business shelves of online and retail book sellers. Now, advances in statistical modeling enable us to inquire at the deep level of detail that drives consumer product purchases. The R code needed to simultaneously cluster the rows and columns of such data matrices has been provided in a series of previous posts on music, cosmetics, personality inventoriesscotch whiskey, feature usage, and the consumer purchase journey.

Sunday, January 11, 2015

Some Applications of Item Response Theory in R

The typical introduction to item response theory (IRT) positions the technique as a form of curve fitting. We believe that a latent continuous variable is responsible for the observed dichotomous or polytomous responses to a set of items (e.g., multiple choice questions on an exam or rating scales from a survey). Literally, once I know your latent score, I can predict your observed responses to all the items. Our task is to estimate that function with one, two or three parameters after determining that the latent trait is unidimensional. In the process of measuring individuals, we gather information about the items. Those one, two or three parameters are assessments of each item's difficulty, discriminability and sensitivity to noise or guessing.

All this has been translated into R by William Revelle, and as a measurement task, our work is done. We have an estimate of each individual's latent position on an underlying continuum defined as whatever determines the item responses. Along the way, we discover which items require more of the latent trait in order to achieve a favorable response (e.g., the difficulty of answering correctly or the extremity of the item and/or the response). We can measure ability with achievement items, political ideology with an opinion survey, and brand perceptions with a list of satisfaction ratings.

To be clear, these scales are meant to differentiate among individuals. For example, the R statistical programming language has an underlying structure that orders the learning process so that the more complex concepts are mastered after the simpler material. In this case, learning is shaped by the difficulty of the subject matter with the more demanding content reusing or building onto what has already been learned. When the constraints are sufficient, individuals and their mastery can be arrayed on a common scale. At one end of the continuum are complex concepts that only the more advanced students master. The easier stuff falls toward the bottom of the scale with topics that almost everyone knows. When you take an R programming achievement test, your score tells me how well you performed relative to others who answered similar questions (see normed-referenced testing).

The same reasoning applied to IRT analysis of political ideology (e.g., the R package basicspace). Opinions tend to follow a predictable path from liberal to conservative so that only a limited number of all possible configurations are actually observed. As shown below, legislative voting follows such a pattern with Senators (dark line) and Representatives (light line) separate along the liberal to conservative dimensions based on their votes in the 113th Congress. Although not shown, all the specific votes can also be placed on this same scale so that Pryor, Landrieu, Baucus and Hagan (in blue) are located toward the right because their votes on various bills and resolutions agreed more often with Republicans (in red). As with achievement testing, an order is imposed on the likely responses of objects so that the response space in p dimensions (where p equals the number of behaviors, items or votes) is reduced to a one-dimensional seriation of both votes and voters on the same scale.

My last example comes from marketing research where brand perceptions tend to organized as a pattern of strengths and weaknesses defined by the product category. In a previous post, I showed how preference for Subway fast food restaurants is associated with a specific ordering of product and service attribute ratings. Many believe that Subway offers fresh and healthy food. Fewer like the taste or feel it is filling. Fewer still are happy with the ordering or preparation, and even more dislike the menu and the seating arrangements. These perceptions have an order so that if you are satisfied with the menu then you are likely to be satisfied with the taste and the freshness/healthiness of the food. Just as issues can be ordered from liberal to conservative, brand perceptions reflect the strengths and weaknesses promised by the brand's positioning. Subway promises fresh and healthy food but not prepackaged and waiting under the heat lamp for easy bagging. The mean levels of our satisfaction ratings will be consistent with those brand priorities.

We can look at the same data from another perspective. Heatmaps summarize the triangular pattern observed in data matrices that can be modeled by IRT. In a second post analyzing the Subway data, I described the following heatmap showing the results from the 8-item checklist of features associated with the brand. Each row is a different respondent with the blue indicating that the item was checked and red telling us that the item was not checked. As one moves down the heatmap, the overall perceptions become more positive as additional attributes are endorsed. Positive brand perceptions are incremental, but the increments are not more of the same. Tasty and filling gets added to healthy and fresh. That is, greater satisfaction with Subway is reflected in the willingness to endorse additional components of the brand promise. The heatmap is triangular so that those who are happy with the menu are likely to be at least as satisfied with all the attributes to the right.

Monday, December 22, 2014

Contextual Measurement Is a Game Changer




Adding a context can change one's frame of reference:

Are you courteous? 
Are you courteous at work? 





Decontextualized questions tend to activate a self-presentation strategy and retrieve memories of past positioning of oneself (impression management). Such personality inventories can be completed without ever thinking about how we actually behave in real situations. The phrase "at work" may disrupt that process if we do not have a prepared statement concerning our workplace demeanor. Yet, a simple "at work" may not be sufficient, and we may be forced to become more concrete and operationally define what we mean by courteous workplace behavior (performance appraisal). Our measures are still self-reports, but the added specificity requires that we relive the events described by the question (episodic memory) rather than providing inferences concerning the possible causes of our behavior.

We have such a data set in R (verbal in the difR package). The data come from a study of verbal aggression triggered by some event: (S1) a bus fails to stop for me, (S2) I miss a train because a clerk gave faulty information, (S3) the grocery store closes just as I am about to enter, or (S4) the operator disconnect me when I used up my last 10 cents for a call. Obviously, the data were collected during the last millennium when there were still phone booths, but the final item can be updated as "The automated phone support system disconnects me after working my way through the entire menu of options" (which seems even more upsetting than the original wording).

Alright, we are angry. Now, we can respond by shouting, scolding or cursing, and these verbally aggressive behaviors can be real (do) or fantasy (want to). The factorial combination of 4 situations (S1, S2, S3, and S4) by 2 behavioral modes (Want and Do) by 3 actions (Shout, Scold and Curse) yields the 24 items of the contextualized personality questionnaire. Respondents are given each description and asked "yes" or "no" with "perhaps" as an intermediate point on what might be considered an ordinal scale. Our dataset collapses "yes" and "perhaps" to form a dichotomous scale and thus avoids the issue of whether "perhaps" is a true midpoint or another branch of a decision tree.

David Magis et al. provide a rather detailed analysis of this scale as a problem in differential item functioning (DIF) solved using the R package difR. However, I would like to suggest an alternative approach using nonnegative matrix factorization (NMF). My primary concern is scalability. I would like to see a more complete inventory of events that trigger verbal aggression and a more comprehensive set of possible actions. For example, we might begin with a much longer list of upsetting situations that are commonly encountered. We follow up by asking which situations they have experienced and recalling what they did in each situation. The result would be a much larger and sparser data matrix that might overburden a DIF analysis but that NMF could easily handle.

Hopefully, you can see the contrast between the two approaches. Here we have four contextual triggering events (bus, train, store, and phone) crossed with 6 different behaviors (want and do by curse, scold and shout). An item response model assumes that responses to each item reflect each individual's position on a continuous latent variable, in this case, verbal aggression as a personality trait. The more aggressive you are, the more likely you are to engage in more aggressive behaviors. Situations may be more or less aggression-evoking, but individuals maintain their relative standing on the aggression trait.

Nonnegative matrix factorization, on the other hand, searches for a decomposition of the observed data matrix within the constraint that all the matrices contain only nonnegative values. These nonnegative restrictions tend to reproduce the original data matrix by additive parts as if one were layering one component after the other on top of each other. As an illustration, let us say that our sample could be separated into the shouters, the scolders, and those who curse based on their preferred response regardless of the situation. These three components would be the building blocks and those who shout their curses would have their data rows formed by the overlay of shout and curse components. The analysis below will illustrate this point.

The NMF R code is presented at the end of this post. You are encourage to copy and run the analysis after installing difR and NMF. I will limit my discussion to the following coefficient matrix showing the contribution of each of the 24 items after rescaling to fall on a scale from 0 to 1.


Want to and Do Scold

Store Closing

Want to and Do Shout

Want to Curse

Do Curse

S2DoScold

1.00
0.19
0.00
0.00
0.00
S4WantScold

0.96
0.00
0.00
0.08
0.00
S4DoScold

0.95
0.00
0.00
0.00
0.11
S1DoScold

0.79
0.37
0.02
0.05
0.15

S3WantScold

0.00
1.00
0.00
0.08
0.00
S3DoScold

0.00
0.79
0.00
0.00
0.00
S3DoShout

0.00
0.15
0.14
0.00
0.00

S2WantShout

0.00
0.00
1.00
0.13
0.02
S1WantShout

0.00
0.05
0.91
0.17
0.04
S4WantShout

0.00
0.00
0.76
0.00
0.00
S1DoShout

0.00
0.12
0.74
0.00
0.00
S2DoShout

0.08
0.00
0.59
0.00
0.00
S4DoShout

0.10
0.00
0.39
0.00
0.00
S3WantShout

0.00
0.34
0.36
0.00
0.00

S1wantCurse

0.13
0.18
0.03
1.00
0.09
S2WantCurse

0.34
0.00
0.08
0.92
0.20
S3WantCurse

0.00
0.41
0.00
0.85
0.02
S2WantScold

0.59
0.00
0.00
0.73
0.00
S1WantScold

0.40
0.22
0.01
0.69
0.00
S4WantCurse

0.31
0.00
0.00
0.62
0.48

S1DoCurse

0.24
0.16
0.01
0.17
1.00
S2DoCurse

0.47
0.00
0.00
0.00
0.99
S4DoCurse

0.46
0.00
0.02
0.00
0.95
S3DoCurse

0.00
0.54
0.00
0.00
0.69

As you can see, I extracted five latent features (the columns of the above coefficient matrix). Although there are some indices in the NMF package to assist in determining the number of latent features, I followed the common practice of fitting a number of different solutions and picking the "best" of the lot. It is often informative to learn how the solutions changes with the rank of the decomposition. In this case similar structures were uncovered regardless of the number of latent features. References to a more complete discussion of this question can be found in an August 29th comment from a previous post on NMF.

Cursing was the preferred option across all the situations, and the last two columns reveal a decomposition of the data matrix with a concentration of respondents who do curse or want to curse regardless of the trigger. It should be noted that Store Closing (S3) tended to generate less cursing, as well as less scolding and shouting. Evidently there was a smaller group that were upset by the store closing, at least enough to scold. This is why the second latent feature is part of the decomposition; we need to layer store closing for those additional individuals who reacted more than the rest. Finally, we have two latent features for those who shout and those who scold across situations. As in principal component analysis, which is also a matrix factorization, one needs to note the size of the coefficients. For example, the middle latent features reveals a higher contribution for wanting to shout over actually shouting.

Contextualized Measurement Alters the Response Generation Process

When we describe ourselves or other, we make use of the shared understandings that enable communication (meeting of minds or brain to brain transfer). These inferences concerning the causes of our own and others behavior are always smoothed or fitted with context ignored, forgotten or never noticed. Statistical models of decontextualized self-reports reflect this organization imposed by the communication process. We believe that our behavior is driven by traits, and as a result, our responses can be fit with an item response model assuming latent traits.

Matrix factorization suggests a different model for contextualized self-reports. The possibilities explode with the introduction of context. Relatively small changes in the details create a flurry of new contexts and an accompanying surge in the alternative actions available. For instance, it makes a differences if the person closing the store as you are about to enter has the option of letting one more person in when you plea that it is for a quick purchase. The determining factor may be an emotional affordance, that is, an immediate perception that one is not valued. Moreover, the response to such a trigger will likely be specific to the situation and appropriately selected from a large repertoire of possible behaviors. Leaving the details out of the description only invites the respondents to fill in the blanks themselves,

You should be able to build on my somewhat limited example and extrapolate to a data matrix with many more situations and behaviors. As we saw here, individuals may have preferred responses that generalize over context (e.g., cursing tends to be overused) or perhaps there will be situation-specific sensitivity (e.g., store closings). NMF builds the data matrix from additive components that simultaneously cluster both the columns (situation-action pairings) and the rows (individuals). These components are latent, but they are not traits in the sense of dimensions over which individuals are ranked ordered. Instead of differentiating dimensions, we have uncovered the building blocks that are layered to reproduce the data matrix.

Although we are not assuming an underlying dimension, we are open to the possibility. The row heatmap from the NMF may follow a characteristic Guttman scale pattern, but this is only one of many possible outcomes. The process might unfold as follows. One could expect a relationship between the context and response with some situations evoking more aggressive behaviors. We could then array the situations by increasing ability to evoke aggressive actions in the same way that items on an achievement test can be ordered by difficulty. Aggressiveness becomes a dimension when situations accumulated like correct answers on an exam with those displaying less aggressive behaviors encountering only the less aggression-evoking situations. Individuals become more aggressive by finding themselves in or by actively seeking increasingly more aggression-evoking situations.


R Code for the NMF Analysis of the Verbal Aggression Data Set

# access the verbal data from difR
library(difR)
data(verbal)
 
# extract the 24 items
test<-verbal[,1:24]
apply(test,2,table)
 
# remove rows with all 0s
none<-apply(test,1,sum)
table(none)
test<-test[none>0,]
 
library(NMF)
# set seed for nmf replication
set.seed(1219)
 
# 5 latent features chosen after
# examining several different solutions
fit<-nmf(test, 5, method="lee", nrun=20)
summary(fit)
basismap(fit)
coefmap(fit)
 
# scales coefficients and sorts
library(psych)
h<-coef(fit)
max_h<-apply(h,1,function(x) max(x))
h_scaled<-h/max_h
fa.sort(t(round(h_scaled,3)))
 
# hard clusters based on max value
W<-basis(fit)
W2<-max.col(W)
 
# profile clusters
table(W2)
t(aggregate(test, by=list(W2), mean))

Created by Pretty R at inside-R.org