Sunday, June 5, 2016

Building the Data Matrix for the Task at Hand and Analyzing Jointly the Resulting Rows and Columns

Someone decided what data ought to go into the matrix. They placed the objects of interest in the rows and the features that differentiate among those objects into the columns. Decisions were made either to collect information or to store what was gathered for other purposes (e.g., data mining).


A set of mutually constraining choices determines what counts as an object and a feature. For example, in the above figure, the beer display case in the convenience store contains a broad assortment of brands in both bottles and cans grouped together by volume and case sizes. The features varying among these beers are the factors that consumers consider when making purchases for specific consumption situations: beers to bring on an outing, beers to have available at home for self and guests, and beers to serve at a casual or formal party.

These beers and their associated attributes would probably not have made it into a taste test at a local brewery. Craft beers come with their own set of distinguishing features: spicy, herbal, sweet and burnt (see the data set called beer.tasting.notes in the ExPosition R package). Yet, preference still depends on consumption occasion for a craft beer needs to be paired with food and desires vary by time of day and ongoing activities and who is drinking with you. What gets measured and how it is measured depends on the situation and the accompanying reasons for data collection (the task at hand).

It should be noted that data matrix construction may evolve over time and may change with context. Rows and columns are added and deleted in order to maintain a type of synchronization with one beer suggesting the inclusion of other beers in the consideration set and at the same time inserting new columns into the data matrix in order to differentiate among the additional beers.

Given such mutual dependencies, why would we want to perform separate analyses of the rows (e.g., cluster analysis) and the columns (e.g., factor analysis), as if row (column) relationships were invariant regardless of the columns (rows)? That is, the similarity of the rows in a cluster analysis depends on the variables populating the columns, and the correlations among the variables are calculated as shared covariation over the rows. The correlation between two variables changes with the individuals sampled (e.g., restriction of range), and the similarity between two observations fluctuates with the basis of comparison (e.g., features included in the object representation). Given the shared determination of what gets included as the rows and the columns of our data matrix, why wouldn't we turn to a joint scaling procedure such as correspondence analysis (when the cells are counts) or biplots (when the cells are ratings or other continuous measures)?

We can look at such a joint scaling using that beer.tasting.notes data set mentioned earlier in this post. Some 38 craft beers were rated on 16 different flavor profiles. Although the ExPosition R package contains functions that will produce a biplot, I will run the analysis with FactoMineR because it may be more familiar and I have written about this R package previously. The biplot below could have been shown in a single graph, but it might have been somewhat crowded with 38 beers and 16 ratings on the same page. Instead, by default, FactoMineR plots the rows on the Individuals factor map and the columns on the Variables factor map. Both plots come from a single principal component analysis with rows as points and columns as arrows. The result is a vector map with directional interpretation, so that points (beers) have perpendicular projections onto arrows (flavors) that reproduce as closely as possible the beer's rating on the flavor variable.



Beer tasting provides a valuable illustration of the process by which we build the data matrix for the task at hand. Beer drinkers are not as likely as those drinking wine to discuss the nuances of taste so that you might not know the referents for some of these flavor attributes. As a child, you learned which citrus fruits are bitter and sweet: lemons and oranges. How would you become a beer expert? You would need to participate in a supervised learning experience with beer tastings and flavor labels. For instance, you must taste a sweet and a bitter beer and be told which is which in order to learn the first dimension of sweet vs. bitter as shown in the above biplot.

Obviously, I cannot serve you beer over the Internet, but here is a description of Wild Devil, a beer positioned on the Individuals Factor Map in the direction pointed to by the labels Bitter, Spicy and Astringent on the Variables Factor Map.
It’s arguable that our menacingly delicious HopDevil has always been wild. With bold German malts and whole flower American hops, this India Pale Ale is anything but prim. But add a touch of brettanomyces, the unruly beast responsible for the sharp tang and deep funk found in many Belgian ales, and our WildDevil emerges completely untamed. Floral, aromatic hops still leap from this amber ale, as a host of new fermentation flavor kicks up notes of citrus and pine.
Hopefully, you can see how the descriptors in our columns acquire their meaning through an understanding of how beers are brewed. It is an acquired taste with the acquisition made easier by knowledge of the manufacturing process. What beers should be included as rows? We will need to sample all those variations in the production process that create flavor differences. The result is a relatively long feature list. Constraints are needed and supplied by the task at hand. Thankfully, the local brewery has a limited selection, as shown in the very first figure, so that all we need are ratings that will separate the five beers in tasting tray.

In this case the data matrix contains ratings, which seem to be the averages of two researchers. Had there been many respondents using a checklist, we could have kept counts and created a similar joint map with correspondence analysis. I showed how this might be done in an earlier post mapping the European car market. That analysis demonstrated how the Prius has altered market perceptions. Specifically, Prius has come to be so closely associated with the attributes Green and Environmentally Friendly that our perceptual map required a third dimension anchored by Prius pulling together Economical with Green and Environmental. In retrospect, had the Prius not been included in the set of objects being rated, would we have included Green and Environmentally Friendly in the attribute list?

Now, we can return to our convenience store with a new appreciation of the need to analyze jointly the rows and the columns of our data matrix. After all, this is what consumers do when they form consideration sets and attend only to the differentiating features for this smaller number of purchase alternatives. Human attention enables us to find the cooler in the convenience store and then ignore most of the stuff in that cooler. Preference learning involves knowing how to build the data matrix for the task at hand and thus simplifying the process by focusing only on the most relevant combination of objects and features.

Finally, I have one last caution. There is no reason to insist on homogeneity over all the cells of our data matrix. In the post referenced in the last paragraph, Attention is Preference, I used nonnegative matrix factorization (NMF) to jointly partition the rows and columns of the cosmetics market into subspaces of differing familiarity with brands (e.g., brands that might be sold in upscale boutiques or department stores versus more downscale drug stores). You should not be confused because the rows are consumers and not objects such as car models and beers. The same principles apply whenever goes in the rows. The question is whether all the rows and columns can be represented in a single common space or whether the data matrix is better described as a number of subspaces where blocks of rows and columns have higher density with sparsity across the rest of the cells. Here are these attributes that separate these beers, and there are those attributes that separate those beers (e.g., the twist-off versus pry-off cap tradeoff applies only to beer in bottles).



R Code to Run the Analysis:
# ExPosition must be installed
library(ExPosition)
 
# load data set and examine its structure
data("beer.tasting.notes")
str(beer.tasting.notes)
head(beer.tasting.notes$data)
 
# how ExPosition runs the biplot
pca.beer <- epPCA(beer.tasting.notes$data)
 
# FactoMiner must be installed
library(FactoMineR)
 
# how FactoMineR runs the biplot
fm.beer<-PCA(beer.tasting.notes$data)
Created by Pretty R at inside-R.org

1 comment:

  1. Hi Joel,

    Interesting read. In many cases as a market researcher I analyze datasets where the row represent individual respondents which are of less interest to the client in isolation. As a result we often aggregate these respondents into groups and use the group scores on the attributes(columns) to compose biplot maps. These groups can often be derived market segments. With the spirit of your question: "why would we want to perform separate analyses of the rows (e.g., cluster analysis) and the columns (e.g., factor analysis)..?" in mind, what are your thoughts on performing factor analyses on attributes, using resulting factors (among other vars) to assign segments, and then constructing a CA/Biplot on the segment means for visualization purposes? Are you saying this is overkill and that we achieve the same goal with the joint scaling procedure, or can it add value, depending upon how the measures used to derive the segmentation (and the segments themselves) mirror reality?

    ReplyDelete