Monday, May 4, 2015

Clusters May Be Categorical but Cluster Membership Is Not All-or-None

Very early in the study of statistics and R, we learn that random variables can be either categorical or continuous. Regrettably, we are forced to relearn this distinction over and over again as we debug error messages produced by our code (e.g., "x must be numeric"). R will reminds us that if the function expects an argument to be a factor, our input ought to be a factor (although sometimes the function will do the conversion for us). Dichotomous variables do give us some flexibility for sex can be entered as a factor with values 'male' and 'female' or coded as numeric with values of 0 and 1 indicating degree of 'maleness' or 'femaleness' depending on whether male or female is assign the value of 1. Similarly, when the categorical variable has many levels, there is no reason not to select one of the levels as the basis for comparison. Then, the dummy coding remains 0 and 1 with the base level coded as 0s for all the comparisons (e.g., Catholic vs Protestant, Jewish vs Protestant, and so on).

Categories vs. Dimensions or Continuous vs. Discrete

The debate over psychiatric classification has bought the battle into the news, as has changes in the admission policies of women's colleges to accept transgender applicants. I have discussed the issue under both clustering and latent variable modeling. It seems just too easy to dissolve the boundaries and blur the distinctions for almost any categorization scheme. For instance, race is categorical, and one is either European or Asian, unless of course, they are some mixture of the two. I have borrowed that example and the following figure from a video lecture by Katherine Heller (also shown as Figure 1 in her paper).


We are familiar with finite mixture model from the mclust R package. Although I have shown only the contour ellipsis, you should be able to imagine the two multivariate normal distributions in some high-dimensional space that would separate Europeans (perhaps the blue ellipse) and Asians (which then must be in green). As geographical barriers fall, racial membership becomes partial with many shades or groupings between the ideal types of the finite mixture model.

Both the finite mixture model (FMM) and the mixed membership model (MMM) permit data points to fall between the centroids of the two most extreme densities. For example, a finite mixture model will yield a probability of cluster membership that can range from 0 to 1. However, the probability from the finite mixture model represents classification error, which increases with cluster overlapping. This is not unlike the misclassification from a discriminant analysis, that is, the groups are distinct but we are unable to separate them completely with the available data. The probabilities from the partial or mixed membership model, on the other hand, do not represent uncertainty but a span of possible clusters arrayed between the two extreme ideals.

The analysis of the voting record from United States senators presented toward the end of Heller's paper might help us understand this previous point. One might infer a latent continuum that separates Democrats and Republicans, but the distribution is not uniform. Democrats tend to bunch at one end and Republicans clump at the other. In between are Democrats from more conservative southern states and Republicans from more liberal northern states. One might argue that the voting dynamics or data generation processes are different for these clusters so that it makes sense to think of the continuum as separated into four regions with boundaries imposed by the political forces constraining their votes.

Interestingly, we learn something about the bills and resolutions in the process of accounting for differences among senators. Some votes are not strictly party-line. For example, senators from states with large military bases often vote the same on appropriations impacting their constituency regardless of their party affiliation. More importantly, the party accepts such votes as necessary and does not demand loyalty unless it is necessary to pass or defeat important bills. Legislation comes in different types and each elicits its own voting strategies (the switching interpretation).

Implementations of Mixed Membership Models in R

One of the less demanding approaches is the grade of membership (GoM) model. The slides from April Galyardt's NIPS 2012 workshop illustrates the major points (see her dissertation for a more complete discussion). R implements the GoM model with the gom.em function in the package sirt. For a somewhat more general treatment, R offers a battery of IRT mixture model packages ( e.g., psychomix and mixRasch). However, nonnegative matrix factorization (NMF) accomplishes a similar mixed membership modeling simultaneously for both the rows and columns of a data matrix using only a decomposition procedure from linear algebra.

Simply put, NMF seeks a common set of latent variables to serve as the basis for both the rows and columns of a data matrix. In our roll call voting example, we might list the senators as the rows and the bills as the columns. Legislation that yielded pure party-line votes would be placed at the extremes of a latent representation that might be called party affiliation. Senators who always vote with their party would also be placed at the ends of a line separating these pure types. We might call this latent variable a dimension representing the Democrat-Republican continuum, although the distribution appears bimodal. Any two points define a line, so we can always infer a dimension even if there is little or no density except at the ends.

Some votes demand party loyalty, but other measures evoke a "protect-my-seat" response (e.g., any bill that helps a large industry or constituency in the senator's state). Such measures would move some senators away from the edges of the liberal-conservative divide as they switched voting strategies from party to state. Alternatively, the bill may elicit social versus economic concerns, or provoke nervousness concerning a primary challenge. Each voting strategy will group bills and cluster senators by generating latent basis vectors. You can think of a NMF as a joint factor analysis of the votes (columns) and cluster analysis of the senators (rows). Each latent variable is a voting strategy so that senators who switch strategies depending on the bill will have mixed memberships as will bills that can be voted for or against for different reasons.

No comments:

Post a Comment