The R package mokken contains a dataset called acl with 433 students who looked at 218 adjectives and told us whether the adjectives described themselves using a 5point rating scale from 0=completely disagree to 4=completely agree. I borrowed this dataset because it has a sizable number of items so that it might pose a challenge for traditional factor analysis. There will be no discussion of the nonparametric item response modeling performed by the mokken package.
Factor Analysis of the Adjective Checklist Ratings
Although R offers many alternatives for running a factor analysis, you might wish to become familiar with the psych package including its extensive documentation and broad coverage of psychometrics. We will start the analysis by attempting to determine the number of factors to extract. The scree plot is presented below.
There appears to be a steep drop for the first six components and then a leveling off. I tried 10 factors, but found that 9 factors yielded an interpretable factor structure. Technically, one might argue that we are not performing a factor analysis since we are not replacing the main diagonal of the correlation matrix with communality estimates. We call this a principal axis or just principal factor analysis for it consists of extracting and rotating principal components.
I have shown only a portion of the factor loadings for the 218 items below. You can reproduce the analysis yourself by installing the mokken package and running the R code at the end of the post. The nine factors were named based on the factor loadings indicating the correlations between the observed adjective ratings and the latent variables. The names have been kept short, and you are free to change them as you deem appropriate. Naming is an art, yet be careful not to add surplus meaning by being overly creative. Several of the factors loadings are negative, for example, an open person is not selfcentered or egotistical. The star next to the name indicates that the scaling has been reversed so that unfriendly* actually means friendly and hostile* is not hostile. This is how the dataset handles negatively worded items.
Open

Calm

Creative

Orderly

Outgoing

Friendly

Smart

Headstrong

Pretty


egotistical

0.63

0.22


illiberal*

0.62

0.13

0.20

0.10

0.12


hostile*

0.61

0.31

0.24

0.11


selfcentered

0.60

0.14

0.15


unfriendly*

0.60

0.29

0.13


tense*

0.11

0.74

0.14


relaxed

0.68

0.17

0.16


nervous

0.12

0.68

0.15

0.10


leisurely

0.62

0.21

0.22

0.11


stable

0.61

0.10

0.22

0.10

0.19

0.29


anxious

0.19

0.61

0.18

0.11


resourceful

0.68

0.16

0.14


inventive

0.64

0.11

0.15


enterprising

0.15

0.64

0.25

0.23

0.18


initiative

0.17

0.60

0.19

0.23

0.21

0.13


versatile

0.60

0.12

0.11

0.25

0.14


orderly

0.77


organized

0.76


planful

0.16

0.71

0.12

0.12


slipshod*

0.70

0.11

0.11


disorderly*

0.13

0.67


withdrawn*

0.24

0.67

0.10


silent*

0.18

0.66

0.18

0.11


shy*

0.28

0.13

0.62


inhibited*

0.11

0.27

0.18

0.62


timid*

0.17

0.16

0.61

0.17


friendly

0.31

0.11

0.16

0.58

0.16

0.15


sociable

0.29

0.18

0.25

0.57


appreciative

0.31

0.20

0.10

0.53

0.13


cheerful

0.38

0.32

0.11

0.30

0.51

0.11

0.10


jolly

0.14

0.27

0.27

0.49

0.11

0.14


intelligent

0.23

0.12

0.57

0.18


clever

0.29

0.57

0.27


rational

0.11

0.11

0.24

0.54


clearthinking

0.16

0.14

0.22

0.17

0.48

0.15


realistic

0.25

0.28

0.46


stubborn*

0.28

0.14

0.60


persistent*

0.23

0.23

0.14

0.59


headstrong

0.31

0.12

0.16

0.53


opinionated

0.24

0.23

0.13

0.12

0.14

0.50


handsome

0.14

0.12

0.12

0.15

0.72


attractive

0.11

0.13

0.15

0.70


goodlooking

0.14

0.19

0.15

0.17

0.68


sexy

0.18

0.20

0.15

0.21

0.19

0.59


charming

0.27

0.31

0.17

0.48

Although the factor loadings are not particularly large, the factor structure is clear. The blank spaces indicate factor loadings with absolute values less than 0.10. I am presenting only the largest loadings in order to avoid 218 rows of decimals. Again, the R code is so easy to copy and paste into R studio that you ought to replicate these findings. In addition, you might wish to examine the 10 factor solution. The authors of the adjective checklist believed that there were 22 subscales. I could not find them with a factor analysis.
Nonnegative Matrix Factorization of the Adjective Checklist Ratings
Nonnegative matrix factorization (NMF) can be interpreted as if it were a factor analysis. Since our factor analysis is a varimaxrotated principal component analysis, I will use the terms principal component analysis (PCA) and factor analysis interchangeably. Both PCA and NMF are matrix factorizations. For PCA, a singular value decomposition (SVD) factors the correlation matrix (R) into the product of factor loadings (F): R = FF'. NMF, on the other hand, decomposes the data matrix (V) into the product of W and H. The term "nonnegative" indicates that all the cells in all three matrices must be zero or greater. We will not see any negative factor loadings.
PCA has its own set of rules. The SVD creates a series of linear combinations of the variables that extract at each step the maximum variation possible with the restriction that each successive principal component is orthogonal to all previous linear combinations. Some of those coefficients will need to be negative in order for SVD to fulfill its mission. The varimax rotation seeks a more simple structure in which each row of factor loadings will contain as many small values as possible. However, as we saw in the above example, those loadings can be negative.
In general, NMF does not demand orthogonal rows of H. Instead, it keeps W and H nonnegative. Consequently, NMF will not extract bipolar factors such as the first factor Open from our factor analysis. That is, the Open factor is actually Open vs. SelfCentered so that one is more open if they reject egotistical as a descriptor. One scores higher on Openness by both embracing the openness adjectives and dismissing the selfcentered items. But this is not the case with NMF. As we shall see below, we will need two latent features, one for selfcentered and the other for open.
[Note: Do not be concerned if NMF takes a minute or two to find a solution when n or p or the rank are large.]
For our adjective checklist data, V is the n x p data matrix with n=433 students and p=218 items. Do we need all 218 adjectives to describe each student? For example, do we need to know your rating for handsome once we know that you consider yourself attractive? The above factor loading suggest that both items load on the same last latent variable, which I named "pretty" to keep the label short.
NMF uses the matrix H to store something like the factor loadings. However, we must remember that NMF is reproducing the data matrix and not the correlation matrix. As a result, the scaling depends on the units in the data matrix. Unlike a PCA of a correlation matrix, which returns factor loadings that are correlations, NMF generates a H matrix that needs to be rescaled for easier interpretation. There are many alternatives, but since zero is the smallest value, it makes sense to rescale H so that the highest value is one (by dividing every row of H by the maximum value in that row). It is all in the R code below, and I encourage you to copy and paste it into R studio.
Only because the checklist was designed to measure 22 subscales scores, I set the rank to 22. The R code also includes a 9 latent variable solution in order for you to make direct comparisons with the factor analysis. However, the 22 latent feature solution illustrates why one might want to replace PCA with NMF.
When printing, H is best transposed so that there are more rows than columns. Instead of the transposed H matrix with 218 rows and 22 columns, I have simply grouped the adjectives with the highest loading on each of the 22 latent features. Of course, you should use the R code listed below and print out the entire H matrix using the fa.sort() function. You are looking for simple structure, that is, a sparse matrix of variable weights. H has been transposed to make it easier to read, and the columns have been rescaled to vary between zero and one. Each column of the transposed H matrix represents a latent feature. If we are successful, those columns will contain only a few adjectives with values near one and most of the remaining values at or near zero.
Adjectives describing happy and reasonable fall into the two groupings in the first column. We find our selfcentered terms in the first grouping of the second column, but unfriendly* will not appear until the second grouping in the last column. I make no claim that 22 latent features are needed for this data matrix. It was only a test of the 22 subscales that I could not recover in the factor analysis. It seems that NMF performs better, although we are not close to identifying the hypothesized subscale pattern. In particular, instead of subscales of equal size, we find some large clusters and several twoitem pairings (e.g., unambitious* and ambitious).
Still, does not the NMF seem to capture our personality folklore? We have a lot of adjectives available to describe ourselves and others. At times, we search for just the right one. Egotistical captures the high opinion of oneself that selfcentered misses. However, the two terms appear to be synonyms in both the PCA and the NMF of this data matrix.
goodnatured

selfcentered

sexy

distrustful*

dominant

suspicious*

healthy

individualistic

flirtatious

defensive*

shy*

interests narrow*

stubborn*

egotistical

goodlooking

fickle*

silent*

intolerant*

sarcastic*

persistent*

charming

deceitful*

bossy

forgiving

contented

indifferent

handsome

calm*

outgoing

poised

appreciative

inventive

attractive

quiet*

outspoken

tolerant

wise

ingenious

practical

worrying*

talkative

steady

spineless

efficient

timid*

modest*

opinionated


snobbish*

hurried*

resentful*

interests wide


unrealistic*

creative

peculiar*

zany*

prejudiced*


realistic

irritable

original

apathetic*

rattlebrained*


selfseeking*

excitable

patient*

faultfinding*

emotional*


rational

impatient

imaginative

moderate*

hostile*

tense*

honest

argumentative

jolly

dull*

selfish*

stable

reliable

quarrelsome

dissatisfied*

shiftless*

unintelligent*

relaxed

gloomy

sociable

unfriendly*

leisurely


fairminded

selfpunishing

spontaneous

forceful

independent*


irresponsible*

anxious

initiative

arrogant

affectionate

orderly

reasonable

selfpitying

resourceful

vindictive

illiberal*

slipshod*

unscrupulous*

confident*

enthusiastic

strong

obnoxious*

precise

sincere

selfdenying

moody*

tactless

kind

disorderly*

reflective

nervous

courageous

aggressive

friendly

painstaking

logical

pessimistic

bitter*

tactful*

foolish*

organized

cynical

weak

cheerful

persevering

thankless*

conscientious

clearthinking

fearful

versatile

quitting*

cold*

planful

hasty*

selfconfident*

pleasureseeking

unkind

pleasant

methodical

intelligent

complaining

energetic

sympathetic


dependable

despondent

optimistic

reserved*


touchy*

curious

active

inhibited*

impulsive


helpful

submissive

spunky

withdrawn*

reckless


mature

cowardly

considerate

aloof*

demanding


selfcontrolled

highstrung*

adventurous


foresighted

absentminded*

unambitious*

restless*

enterprising


affected*

confused*

ambitious

prudish*


serious

forgetful*

noisy

cautious*


independent

preoccupied*

cruel*

loud

rebellious


deliberate

dreamy*

whiny

daring


responsible

distractible*

lazy*

infantile

temperamental


understanding

alert

industrious

dependent

fussy*


insightful

witty

nagging*


clever

humorous

mischievous


conceited

determined


headstrong

thorough


sharpwitted

uninhibited


boastful

Does this not look like topic modeling from Figure 4 of Lee and Seung? As you might recall, topic modeling gathers word counts from documents and not respondents. The goal is to learn what topics are covered in the documents, but all we have are words. Students are the documents, and adjectives are the words. We do not have counts of the number of times each adjective was used by each student. Instead, we have a set of ratings of the intensity with which each adjective describes the respondent. One could have conducted a similar analysis using openended text and counting words. Selfdescription can take many forms. As long as the measures are nonnegative, NMF will work, especially when the data matrix is sparse.
Try It!
This is my last appeal. R makes it so easy to fit many models to the same data. Better yet, why not try NMF with your own data? Every time you perform a factor analysis copy the half dozen lines of R code and see what a NMF yields.
PCA and factor analysis seek global dimensions explaining as much variation as possible with each successive latent variable. Subsequent rotations to simple structure help achieve more interpretable factors. Yet, as we have observed in this example, we still uncover bipolar dimensions that simultaneously push and pull by assigning positive and negative weights to many different variables at the same time. Sometimes, we want bipolar dimensions that separate the sweet from the sour but not when the objects of our study can be both sweet and sour at the same time.
NMF, on the other hand, forces us to keep the variable weighting nonnegative. The result is a type of simple structure without rotation. Consequently, local structures residing within only a small subset of the variables can be revealed. We have separated the sweet from the sour, and our individual respondents can be both friendly and selfcentered. PCA and NMF impose different constraints, and those restrictions deliver different representational systems. One might say that PCA returns latent dimensions and NMF generates latent features or categories. I will have more to say about latent dimensions and categories in later posts. Specifically, I will be reviewing the theoretical basis underlying the R package plfm and arguing that NMF produces results that look very much like probabilistic latent features.
Warning: NMF May Transpose the Data Matrix Reversing Role of W and H
While all my data matrices are structured with individuals as rows and variables as columns, this is not the way it is done in topic modeling or with gene expression data. Do not be fooled. Ask yourself, "Where are the latent variables?" When the data matrix V is individualsbyvariables, the latent variables are the rows of H defined by their relationship with the observed variables in the columns of H. When the data matrix is wordsbydocuments, the latent variables are the columns of W defined by the words in the rows of W. This is why the NMF package refers to W as the basis and H as the mixture coefficient matrix.
We can represent our data matrix in a geometric space with the respondents as points and the variables as axes or dimensions (biplots). Since p is often smaller than n (fewer variables than respondents), this space is pdimensional and the p vectors form the basis for this variable space. When the variables are features, we speak of the pdimensional feature space. Now, NMF seeks a lower dimensional representation of the data, that is, lower than p (218 items in our example). The 22 groups of items presented above is such a lower dimensional representation with a basis of rank equal to 22. We interpret this new 22dimensional basis using H because H tells us the contribution of each adjective. The adjective groupings bring together the items with the largest weights on each of dimension of the new 22dimensional basis. Thus, H is the basis matrix for our data matrix.
Do not be confused by the names used in the NMF package. With our data matrices (individualsbyvariables) the basis or "factor loadings" can be found in what NMF called the mixture coefficient matrix H. Thus, use the coef() and coefmap() functions to learn which variables "load" on which latent variables. For greater detail or to learn more about the other matrix W, you can read my earlier posts on NMF.
How Much Can We Learn from Top Rankings Using NMF?
Uncovering the Preferences Shaping Consumer Data
Taking Inventory
Customer Segmentation Using Purchase History
Exploiting Heterogeneity to Reveal Consumer Preference
R code to run all the analyses in this post:
# Attach acl data file after # mokken package installed library(mokken) data(acl) # run scree plot # print eigenvalues # principalaxis factor analysis # may need to install GPArotation library(psych) eigen<scree(acl, factors=FALSE) eigen$pcv pca<principal(acl, nfactors=9) fa.sort(pca$loadings) library(NMF) fit<nmf(acl, 9, "lee", nrun=20) h<coef(fit) max_h<apply(h,1,function(x) max(x)) h_scaled<h/max_h fa.sort(t(round(h_scaled,3))) fit22<nmf(acl, 22, "lee", nrun=20) h<coef(fit22) max_h<apply(h,1,function(x) max(x)) h_scaled<h/max_h fa.sort(t(round(h_scaled,3)))
Created by Pretty R at insideR.org
Your posts are excellent! I've been following them since I discovered them. A question on nmf: how to determine the number of hidden variables/features? In the R code that you have provided in this post as well as several posts before, this number is given. Is there a way to "automatically" select this number, just like we can select the number of clusters based on some measurements such as residual SS in clustering analysis? Thank you.
ReplyDeleteYou can find an answer in my August 29th reply to a question on my post titled "Continuous or Discrete Latent Structure?" published on August 25, 2014. It took me several minutes to find that comment again, so I can understand how it might have been missed. But I believe that it is helpful to see similar questions from others. Just one additional point, the residual SS is noise only if additional latent features are not helpful. One more cluster is welcomed if it assists in the explanation or increases sales. And thanks for your feedback.
Delete