Saturday, May 30, 2015

Top of the Heap: How Much Can We Learn from Partial Rankings?

The recommendation system gives you a long list of alternatives, but the consumer clicks on only a handful: most appealing first, then the second best, and so on until they stop with all the remaining receiving the same rating as not interesting enough to learn more. As a result, we know the preference order for only the most preferred. Survey research may duplicate this process by providing many choices and asking only for your top three selections - your first, second and third picks. This post will demonstrate that by identifying clusters with different ranking schemes (mixtures of rankings), we can learn a good deal about consumer preferences across all the alternatives from observing only a limited ordering of the most desired (partial top-K rankings).

However, we need to remain open to the possibility that our sample is not homogeneous but contains mixtures of varying ranking schemes. To be clear, the reason for focusing on the top-K rankings is because we have included so many alternatives that no one person will be able to respond to all of them. For example, the individual is shown a screen filled with movies, songs, products or attributes and asked to pick the best of the list in order of preference. Awareness and familiarity will focus attention on some subset, but not the same subset for everyone. We should recall that N-K of the possible options will not be selected and thus be given zeros. Consequently, with individuals in rows and alternatives as columns, no one should be surprised to discover that the data matrix has a blockcluster appearance (as in the R package with the same name).
To see how all this works in practice, we begin by generating complete ranking data using the simulISR( ) function from the R package Rankcluster. The above graphic, borrowed from Wikipedia, illustrates the Insertion Sort Ranking (ISR) process that Rankcluster employs to simulate rankings. We start with eight objects in random order and sort them one at a time in a series of paired comparisons. However, the simulation function from Rankcluster allows us to introduce heterogeneity by setting a dispersion parameter called pi. That is, we can generate a sample of individuals sharing a common ranking scheme, yet with somewhat different observed rankings from the addition of an error component.

As an example, everyone intends to move #7 to be between #6 and #8, but some proportion of the sample may make "mistakes" with that proportion controlled by pi. Of course, the error could represent an overlap in the values associated with #6 and #7 or # 7 and #8 so that sometimes one looks better and other times it seems the reverse (sensory discrimination). Regardless, we do not generate a set of duplicate rankings. Instead, we have a group of ranks distributed about a true rank. The details can be found in their technical paper.

You will need to install the Rankcluster and NMF packages in order to run the following R code.

# Rankcluster needed to simulate rankings
library(Rankcluster)
 
# 100 respondents with pi=0.90
# who rank 20 objects from 1 to 20
rank1<-simulISR(100, 0.90, 1:20)
 
# 100 respondents with pi=0.90
# who rank 20 object in reverse order
rank2<-simulISR(100, 0.90, 20:1)
 
# check the mean rankings
apply(rank1, 2, mean)
apply(rank2, 2, mean)
 
# row bind the two ranking schemes
rank<-rbind(rank1,rank2)
 
# set ranks 6 to 20 to be 0s
top_rank<-rank
top_rank[rank>5]<-0
 
# reverse score so that the
# scores now represent intensity
focus<-6-top_rank
focus[focus==6]<-0
 
# use R package NMF to uncover
# mixtures with different rankings
library(NMF)
fit<-nmf(focus, 2, method="lee", nrun=20)
 
# the columns of h transposed
# represent the ranking schemes
h<-coef(fit)
round(t(h))
 
# w contains the membership weights
w<-basis(fit)
 
# hard clustering
type<-max.col(w)
 
# validates the mixture model
table(type,c(rep(1,100),rep(2,100)))

Created by Pretty R at inside-R.org

We begin with the simulIST( ) function simulating two sets of 100 rankings each. The function takes three arguments: the number of rankings to be generated, the value of pi, and the rankings listed for each object. The sequence 1:20 in the first ranking scheme indicates that there will be 20 objects ordered from first to last. Similarly, the sequence 20:1 in the second ranking scheme inputs 20 objects ranked in reverse from last to first. We concatenate data produced by the two ranking schemes and set three-quarters of the rankings to 0 as if only the top-5 rankings were provided. Finally, the scale is reversed so that the non-negative values suggest greater intensity with five as the highest score.

The R package NMF performs the nonnegative matrix factorization with the number of latent features set to two, the number of ranking schemes generating the data. I ask that you read an earlier post for the specifics of how to use the R package NMF to factor partial top-K rankings. More generally though, we are inputting a sparse data matrix with zeros filling 75% of the space. We are trying to reproduce that data (labeled V in the diagram below) by multiplying two matrices. One has a row for every respondent (w in the R code), and the other has a column for every object that was ranked (h in the R code). What links these two matrices is the number of latent features, which in this case happens also to be two because we simulated and concatenated two ranking schemes.



Let us say that we placed 20 bottles of wine along a shelf so that the cheapest was in the first position on the left and the most expensive was last on the shelf on the far right. These are actual wines so that most would agree that the higher priced bottles tended to be of higher quality. Then, our two ranking schemes could be called "price sensitivity" and "demanding palette" (feel free to substitute less positive labels if you prefer). If one could only be Price Sensitive or Demanding Palette and nothing in between, then you would expect precisely 1 to 20 and 20 to 1 rankings for everyone in each segment, respectively, assuming perfect knowledge and execution. That is, some of our drinkers may be unaware that #16 received a higher rating than #17 or simply give it the wrong rank. This is encoded in our pi parameter (pi=0.90 in this simulation). Still, if I knew your group membership and the bottle's position, I could predict your ranking with some degree of accuracy varying with pi.

Nonnegative matrix factorization (NMF) seeks to recover the latent features separating the wines and the latent feature membership for each drinker from the data matrix, which you recall does not contain complete rankings but only the partial top-K. Since I did not set the seed, your results will be similar, but not identical, to the following decomposition.

Columns
h
Demanding Palette
Price Sensitivity
Rows
w
Demanding Palette
Price Sensitivity
C1
0
368
R1
0.00000
0.01317
C2
0
258
R2
0.00100
0.00881
C3
0
145
R3
0.00040
0.00980
C4
4
111
R4
0.00105
0.00541
C5
18
68
R5
0.00000
0.01322
C6
49
80
R6
0.00000
0.01207
C7
33
59
R7
0.00291
0.00541
C8
49
61
R8
0.00361
0.00416
C9
45
50
R9
0.00242
0.01001
C10
112
31
.
.
.
C11
81
30
.
.
.
C12
63
9
.
.
.
C13
79
25
R193
0.01256
0.00000
C14
67
18
R194
0.00366
0.00205
C15
65
28
R195
0.01001
0.00030
C16
79
28
R196
0.00980
0.00000
C17
85
14
R197
0.00711
0.00028
C18
93
5
R198
0.00928
0.00065
C19
215
0
R199
0.01087
0.00000
C20
376
0
R200
0.01043
0.00000

The 20 columns from transposed h are presented first, and then the first few rows followed by the last rows from w. These coefficients will reproduce the data matrix, which contains numbers from 0 to 5. For instance, the reproduced score for the first respondent for the first object is 0*0.00000 + 386*0.01317 = 4.84656 or almost 5, suggesting that they most prefer the cheapest wine. In a similar fashion, the last row, R200, gives greater weight to the first column, and the first column seems to prefer the higher end of the wine continuum.

Clearly, there are some discrepancies toward the middle of the wine rankings, yet the ends are anchored. This makes sense given that we have data only on the top-5 rankings. Our knowledge of the ten objects in the middle comes solely from the misclassification when making pairwise comparisons set by pi=0.90. In the aggregate we seem to be able to see some differentiation even when we did not gather any individual data after the Kth position. Hopefully, C1 represents wine in a box and C20 is a famous vintage from an old village with a long wine history, making our interpretation of the latent features easier for us.

When I run this type of analysis with actual marketing data, I typically uncover many more latent features and find respondents with sizable membership weightings spread across several of those latent features. Preference for wine is based on more than a price-quality tradeoff, so we would expect to see other latent features accounting for the top-5 rankings (e.g., the reds versus the whites). The likelihood that an object makes it into the top-5 selection is a decreasing function of its rank order across the entire range of options so that we might anticipate some differentiate even when the measurement is as coarse as a partial ranking. NMF will discover that decomposition and reproduce the original rankings as I have shown with the above example. It seems that there is much we can learn for partial rankings.

No comments:

Post a Comment