As Wilkinson and Friendly note in their American Statistician paper on The History of the Cluster Heat Map, "a heat map is a visual reflection of a statistical model." Consequently, the unordered data matrix and its accompanying heat map provide little assistance when searching for structure. Nor will any reordering do, only one that reflects the data generation process. The underlying pattern in the data is discoverable only when we have rearranged the rows and columns in a manner consistent with the processes that produced our data in first place. Jacques Bertin is a bit more poetic, "A graphic is not 'drawn' once and for all; it is 'constructed' and reconstructed until it reveals all the relationships constituted by the interplay of the data."
To get more specific, let us look at a heat map from a simulated data matrix with 150 respondents rating on a nine-point scale how much they agreed with each of eight attitude statements measuring their willingness to adopt cloud computing technology.
This is cluster heat map with yellow indicating higher agreement and red showing the lower end of the same scale. Both the rows and the columns have been ordered by separate hierarchical cluster analyses. It appears that the columns are divided between the four positive attitudes toward cloud computing on the left and the four negative statements on the right. The rows are also ordered. The first 50 respondents give cloud computing a "thumbs-down" by endorsing the negative statements and rejecting the positive ones. The reverse is true for the last 50 respondents. Those in middle are redder than the other respondents, suggesting that they might be more ambivalent or simply less informed. It seems that the heat map can reveal the presence of well-separated clusters when we reorder the rows and columns to reflect that clustering. Here is the same heat map with random sorting of the rows and columns.
The difference between these two heat maps is not that one is ordered and the other is not. There are forces at work in the data matrix creating differences across both the respondents (rows) and the variables (columns). The row dendrogram works because we have three clusters or segments of respondents, each with a unique pattern of attitude ratings. Similarly, the factor structure underlying the eight ratings is mirrored in the column dendrogram. Little would have been revealed had the sorting not been based on a statistical model that captured these forces.
Perhaps it would help if we looked at another example with a different underlying structure. Below, Wilkinson and Friendly show a binary data matrix display with several different settlements as the rows. They have noted when each settlement offered each of the functions listed in the columns by darkening the cell.
Because the rows and columns have been sorted or reordered, we can clearly see the underlying pattern. The settlements with the most functions were reordered to be located near the top of the table. Similarly, the functions were sorted so that the least frequent ones were positioned toward the right hand side of the table. Black and white cells were used rather than the numbers zero and one, but the same pattern would be revealed regardless of how we represented presence or absence. The numbers next to each settlement name indicate the population in thousands, but this information was not used to reorder the matrix.
The pattern that we observe is known as the Guttman Scalogram. That is, the columns have been arranged in ascending order so the presence of a function at a higher level implies the presence of all the functions at lower levels. Item response theory (IRT) can be derived as a probabilistic generalization of this unidimensional model describing the relationship between the rows and columns of a data matrix. It is a cumulative model, as can be clearly seen from the above visualization. All the settlements provide the same few basic functions, and functions tended to added on in a relatively fixed order as one moves up the data table.
Of course, we could have fit an item response model using the ltm R package without generating our reordered matrix visualization. But what if the underlying pattern had not been a Guttman Scalogram? For example, what if we discovered a block-diagonal structure, such as the following from Jacques Bertin?
The features are no longer cumulative, so IRT is no longer useful as a model. Larger communities do not retain the features found in smaller villages. The defining features of rural villages are replaced with substitutes that fulfill the same function; one-room school houses (ecole classe unique) are displaced by grade-level classrooms (collège or high schools). We still have a single dimension, but now the levels are stages or ordered types. Displacement describes a disruptive process in which the "turning on" of one cluster of cells at a higher level results in the "turning off" of another cluster of cells at a lower level.
This data set, called Townships, comes with the R package seriation (see Section 5.6). Although one could reorder the rows and column manually for small data matrices, we will need some type of loss function when the data matrix is large. Hahsler, Hornik, and Buchta used the bond energy algorithm (BEA) as their method in their seriate( ) function to produce a Bertin Plot almost identical to the reorder matrix above. BEA seeks a "clumpy" arrangement of the rows and columns so as to maximize contiguous chunks. Importantly, this algorithm seems to mimic the displacement described earlier as the response generation process.
In order to avoid confusion, it should be noted that the data matrices that we have considered so far have been two-mode, meaning that the rows and columns have been two different types of objects. Seriation can also be applied to one-mode matrices, for example, a matrix of dissimilarities where the rows and the columns refer to the same objects. I have decided to keep the two types of analyses separate by not discussing one-mode seriation in this post.
Seriation is not cluster analysis, although hierarchical cluster analysis can be used to seriate as we saw in the beginning of this post when we examined a cluster heat map. Like cluster analysis, it is unsupervised learning with many different algorithms. However, unlike cluster analysis, seriation seeks a global continuum along which to order all the objects. Seriation has no objection to local structure with objects clustered together along the path, as long as there is a sequencing of all the objects (e.g., an optimal leaf ordering that interconnect objects on the edges of adjacent clusters). In this sense, seriation is an exploratory technique seeking to reveal the patterns contained in the data matrix. And because there is such a diversity of structures, seriation requires many different algorithms for regrouping row and columns (see an historical overview of Seriation and Matrix Reordering Methods by Innar Liiv).
The Law of the Instrument ("Got Hammer - See Nails Everywhere") leads us to overuse our models without careful thought about the data generation process. Visualization can be the antidote, but we must remember that visualization is theory-laden. It looks easy only when we have a clear idea of how the data were generated and know how to reorder the data matrix. Each permutation yields a different view of the data matrix, and a diversity of viewpoints is required to tease out the hidden patterns.
Nice article, do you think you could show us the code used to generate the top two heat-maps?
ReplyDeleteLet ratings be a 150x8 data frame with 150 respondents and 8 variables representing the 1 to 9 ratings. The heatmap() function from stats wants a matrix and lets you do some formatting of the plot. The first heatmap uses the default ordering from hierarchical clustering.
Deleteheatmap(as.matrix(ratings), cexCol=1.0, cexRow=0.5, margins = c(10,1))
For the second heatmap, I randomized the position of the rows and columns. Then I run the heatmap without any sorting.
x<-as.matrix(ratings)
x<-x[sample(seq_len(nrow(x))),]
x<-x[,sample(seq_len(ncol(x)))]
heatmap(x, cexCol=1.0, cexRow=0.3, margins = c(10,1), Rowv=NA, Colv=NA)
This is very interesting stuff. Thank you for sharing!
ReplyDelete