Pages

Wednesday, July 2, 2014

Using Biplots to Map Cluster Solutions

FactoMineR is a quick and easy R package for generating biplots, such as the following plot showing the columns as arrows with the rows to be added later as points. As you might recall from a previous post, a biplot maps a data matrix by plotting both the rows and columns in the same figure. Here the columns (variables) are arrows and the rows (individuals) will be points. By default, FactoMineR avoids cluttered maps by separating the variables and individuals factor maps into two plots. The variables factor map appears below, and the individuals factor map will be shown later in this post.
The dataset comes from David Wishart's book Whiskey Classified, Choosing Single Malts by Flavor. Some 86 whiskies from different regions of Scotland were rated on 12 aromas and flavors from "not present" (a rating of 0) to "pronounced" (a rating of 4). Luba Gloukhov ran a cluster analysis with this data and plotted the location where each whisky was distilled on a map of Scotland. The dataset can be retrieved as a csv file using the R function read.csv("clipboard'). All you need to do is go to the web site, select and copy the header and the data, and run the R function read.csv pointing to the clipboard. All the R code is presented at the end of this post.

Each arrow in the above plot represents one of the 12 ratings. FactoMineR takes the 86 x 12 matrix and performs a principal component analysis. The first principal component is labeled as Dim 1 and accounts for almost 27% of the total variation. Dim 2 is the second principal component with an additional 16% of the variation. One can read the component loadings for any rating by noting the perpendicular projection of the arrow head onto each dimension. Thus, Medicinal and Smoky have high loadings on the first principal component with Sweetness, Floral and Fruity anchoring the negative end. One could continue in the same manner with the second principal component, however, at some point we might notice the semi-circle that runs from Floral, Sweetness and Fruity through Nutty, Winey and Spicy to Smoky, Tobacco and Medicinal. That is, the features sweep out a one-dimensional arc, not unlike a multidimensional scaling of color perceptions (see Figure 1).
Now, we will add the 86 points representing the different whiskies. But first we will run a cluster analysis so that when we plot the whiskies, different colors will indicate cluster membership. I have included the R code to run both a finite mixture model using the R package mclust and a k-means. Both procedures yield four-cluster solutions that classify over 90% of the whiskies into the same clusters. Luba Gloukhov also extracted four clusters by looking for an "elbow" in the plot of the within-cluster sum-of-squares from two through nine clusters. By default, Mclust will test one through nine clusters and select the best model using the BIC as the selection criteria. The cluster profiles from mclust are presented below.

Black Red Green Blue Total
27 36 6 17 86
31% 42% 7% 20% 100%
Body 2.7 1.4 3.7 1.9 2.1
Sweetness 2.4 2.5 1.5 2.1 2.3
Smoky 1.5 1.0 3.7 1.9 1.5
Medicinal 0.0 0.2 3.3 1.0 0.5
Tobacco 0.0 0.0 0.7 0.3 0.1
Honey 1.9 1.1 0.2 1.0 1.3
Spicy 1.6 1.1 1.7 1.6 1.4
Winey 1.9 0.5 0.5 0.8 1.0
Nutty 1.9 1.3 1.2 1.4 1.5
Malty 2.1 1.7 1.3 1.7 1.8
Fruity 2.1 1.9 1.2 1.3 1.8
Floral 1.6 2.1 0.2 1.4 1.7

Finally, we are ready to look at the biplot with the rows represented as points and the color of each point indicating cluster membership, as shown below in what FactoMineR calls the individuals factor map. To begin, we can see clear separation by color suggesting that differences among the cluster reside in the first two dimensions of this biplot. It is important to remember that the cluster analysis does not use the principal component scores. There is no data reduction prior to the clustering.
The Green cluster contains only 6 whiskies and falls toward the right of the biplot. This is the same direction as the arrows for Medicinal, Tobacco and Smoky. Moreover, the Green cluster received the highest scores on these features. Although the arrow for Body does not point in that direction, you should be able to see that the perpendicular projection of the Green points will be higher than that for any other cluster. The arrow for Body is pointed upward because a second and larger cluster, the Black, also receives a relatively high rating. This is not the case for other three ratings. Green is the only cluster with high ratings on Smoky or Medicinal. Similarly, though none of the whiskies score high on Tobacco, the six Green whiskies do get the highest ratings.

You can test your ability to interpret biplots by asking on what features the Red cluster should score the highest. Look back up to the vector map, and identify the arrows pointing in the same direction as the Red cluster or pointing in a direction so that the Red points will project toward the high end of the arrow. Do you see at least Floral and Sweetness? The process continues in the same manner for the Black cluster, but the Blue cluster, like its points, fall in the middle without any distinguishing features.

Hopefully, you have not been troubled by my relaxed and anthropomorphic writing style. Vectors do not reposition themselves so that all the whiskies earning high scores will project themselves toward its high end, and points do not move around looking for that one location that best reproduces all their ratings. However, principal component analysis does use a singular value decomposition to factor data matrices into row and column components that reproduce the original data as closely as possible. Thus, there is some justification for such talk. Nevertheless, it helps with the interpretation to let these vectors and points come alive and have their own intentions.

What Did We Do and Why Did We Do It?

We began trying to understand a cluster analysis derived from a data matrix containing the ratings for 86 whiskies across 12 aroma and taste features. Although not a large data matrix, one still has some difficulty uncovering any underlying structure by looking one variable/column at a time. The biplot helps by creating a low-dimensional graphic display with ratings as vectors and whiskies as points. The ratings appeared to be arrayed along an arc from floral to medicinal, and the 86 whiskies were located as points in this same space.

Now, we are ready to project the cluster solution onto this biplot. By using separate ratings, the finite mixture model worked in the 12-dimensional rating space and not in the two-dimensional world of the biplot. Yet, we see relatively coherent clusters occupying different regions of the map. In fact, except for the Blue cluster falling in the middle, the clusters move along the arc from a Red floral to a Black malty/honey/nutty/winey to a Green medicinal. The relationships among the four clusters are revealed by their color coding on the biplot. They are no longer four qualitatively distinct entries, but a continuum of locally adjacent groupings arrayed along a nonlinear dimension from floral to medicinal.

R code needed to run all the analysis in this post.

# read data from external site
# after copied into the clipboard
data <- read.csv("clipboard")
ratings<-data[,3:14]
 
# runs finite mixture model
library(mclust)
fmm<-Mclust(ratings)
fmm
table(fmm$classification)
fmm$parameters$mean
 
# compares with k-means solution
kcl<-kmeans(ratings, 4, nstart=25)
table(fmm$classification, kcl$cluster)
 
# creates biplots
library(FactoMineR)
pca<-PCA(ratings)
plot(pca, choix=c("ind"), label="none", col.ind=fmm$classification)

Created by Pretty R at inside-R.org

4 comments:

  1. Fascinating - both analytically and from a whisky point of view! Is there any reason it would be a bad idea to overlay the two plots?

    ReplyDelete
    Replies
    1. It can get a little cluttered with 86 points and 12 vectors on the same plot, especially if one labels the points. Separating the two plots, as FactoMineR does by default, is one solution. The BiplotGUI R package takes a different approach. It places both rows and columns on the same plot as points and vectors. Then, it allows you to interact with the plot by highlighting specific columns or points and seeing only those relationships. It all depends on your preference and intent.

      Delete
    2. I can well believe it! Thanks for the tip about BiplotGUI, I'll have a look.

      Delete
  2. No, but I visited using your link. We are not able to speak of the complexities of the taste experience until we are provided the vocabulary. Thanks.

    ReplyDelete