Wednesday, May 25, 2016

Using Support Vector Machines as Flower Finders: Name that Iris!

Nature field guides are filled with pictures of plants and animals that teach us what to look for and how to name what we see. For example, a flower finder might display pictures of different iris species, such as the illustrations in the plot below. You have in hand your own specimen from your garden, and you carefully compare it to each of the pictures until you find a good-enough match. The pictures come from Wikipedia, but the data used to create the plot are from the R dataset iris: sepal and petal length and width measured on 150 flowers equally divided across three species.

I have lifted the code directly from the svm function in the R package e1071.
## classification mode
# default with factor response:
model <- svm(Species ~ ., data = iris)
# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y) 
# test with train data
pred <- predict(model, x)
# (same as:)
pred <- fitted(model)
# Check accuracy:
table(pred, y)
# compute decision values and probabilities:
pred <- predict(model, x, decision.values = TRUE)
attr(pred, "decision.values")[1:4,]
# visualize (classes by color, SV by crosses):
     col = as.integer(iris[,5]),
     pch = c("o","+")[1:150 %in% model$index + 1])
Created by Pretty R at

We will focus on the last block of R code that generates the metric multidimensional scaling (MDS) of the distances separating the 150 flowers calculated from sepal and petal length and width (i.e., the dist function applied to the first four columns of the iris data). Species plays no role in the MDS with the flowers positioned in a two-dimensional space in order to reproduce the pairwise Euclidean distances. However, species is projected onto the plot using color, and the observations acting as support vectors are indicated with plus signs (+).

The setosa flowers are represented by black circles and black plus signs. These points are separated along the first dimension from the versicolor species in red and virginica in green. The second dimension, on the other hand, seems to reflect some within-species sources of differences that do not differentiate among the three iris types.

We should recall that the dist function calculates pairwise distances in the original space without any kernel transformation. The support vectors, on the other hand, were identified from the svm function using a radial kernel and then projected back onto the original observation space. Of course, we can change the kernel, which defaults to "radial" as in this example from the R package. A linear kernel may do just as well with the iris data, as you can see by adding kernel="linear" to the svm function in the above code.

It appears that we do not need all 150 flowers in order to identify the iris species. We know this because the svm function correctly classifies over 97% of the flowers with 51 support vectors (also called "landmarks" as noted in my last post Seeing Similarity in More Intricate Dimensions). The majority of the +'s are located between the two species with the greatest overlap. I have included the pictures so that the similarity of the red and green categories is obvious. This is where there will be confusion, and this is where the only misclassifications occur. If your iris is a setosa, your identification task is relatively easy and over quickly. But suppose that your iris resembles those in the cluster of red and green pluses between versicolor and virginica. This is where the finer distinctions are being made.

By design, this analysis was kept brief to draw an analogy between support vector machines and finder guides that we have all used to identify unknown plants and animals in the wild. Hopefully, it was a useful comparison that will help you understand how we classify new observations by measuring their distances in a kernel metric from a more limited set of support vectors (a type of global positioning with a minimal number of landmarks or exemplars as satellites).

When you are ready with your own data, you can view the videos from Chapter 9 of An Introduction to Statistical Learning with Applications in R to get a more complete outline of all the steps involved. My intent was simply to disrupt the feature mindset that relies on the cumulative contributions of separate attributes (e.g., the relative impact of each independent variable in a prediction equation). As objects become more complex, we stop seeing individual aspects and begin to bundle features into types or categories. We immediately recognize the object by its feature configuration, and these exemplars or landmarks become the new basis for our support vector representation.

Monday, May 23, 2016

The Kernel Trick in Support Vector Machines: Seeing Similarity in More Intricate Dimensions

The "kernel" is the seed or the essence at the heart or the core, and the kernel function measures distance from that center. In the following example from Wikipedia, the kernel is at the origin and the different curves illustrate alternative depictions of what happens as we move away from zero.

At what temperature do you prefer your first cup of coffee? If we center the scale at that temperature, how do we measure the effects of deviations from the ideal level. The uniform kernel function tells us that closest to the optimal makes little difference as long as it is within a certain range. You might feel differently, perhaps it is a constant rate of disappointment as you move away from the best temperature in either direction (a triangular kernel function). However, for most us, satisfaction takes the form of exponential decay with a Gaussian kernel describing our preferences as we deviate from the very best.

Everitt and Hothorn show how it is done in R for density estimation. Of course, the technique works with any variable, not just preference or distance from the ideal. Moreover, the logic is the same: give greater weight to closer data. And how does one measure closeness? You have many alternatives, as shown above, varying from tolerant to strict. What counts as the same depends on your definition of sameness. With human vision the person retains their identity and our attention as they walk from the shade into the sunlight; my old camera has a different kernel function and fails to keep track or focus correctly. In addition, when the density being estimated is multivariate, you have the option of differential weighting of each variable so that some aspects will count a great deal and others can be ignored.

Now, with the preliminaries over, we can generalize the kernel concept to support vector machines (SVMs). First, we will expand our feature space because the optimal cup of coffee depends on more than its temperature (e.g., preparation method, coffee bean storage, variation on finest and method of grind, ratio of coffee to water, and don't forget the the type of bean and its processing). You tell me the profile of two coffees using all those features that we just enumerated, and I will calculate their pairwise similarity. If their profiles are identical, the two coffees are the same and centered at zero. But if they are not identical, how important are the differences? Finally, we ought to remember that differences are measured with respect to satisfaction, that is, two equally pleasing coffees may have different profiles but the differences are not relevant.

As the Mad Hatter explained in the last post, SVMs live in the observation space, in this case, among all the different cups of coffees. We will need a data matrix with a bunch of coffees for taste testing in the rows and all those features as columns, plus an additional column with a satisfaction rating or at least a thumbs-up or thumbs-down. Keeping it simple, we will stay with a classification problem distinguishing good from bad coffees. Can I predict your coffee preference from those features? Unfortunately, individual tastes are complex and that strong coffee may be great for some but only when hot. What of those who don't like strong coffee? It is, as if, we had multiple configurations of interacting nonlinear features with many more dimensions than can be represented in the original feature space.

Our training data from the taste tests might contain actual coffees near each of these configurations differentiating the good and the bad. These are the support vectors of SVMs, what Andrew Ng calls "landmarks" in his Coursera course and his more advanced class at Stanford. In this case, the support vectors are actual cups of coffee that you can taste and judge as good or bad. Chapter 9 of An Introduction to Statistical Learning will walk you through the steps, including how to run the R code, but you might leave without a good intuitive grasp of the process.

It would help to remember that a logistic regression equation and the coefficients from a discriminant analysis yield a single classification dimension when you have two groupings. What happens when there are multiple ways to succeed or fail? I can name several ways to prepare different types of coffee, and I am fond of them all. Similarly, I can recall many ways to ruin a cup of coffee. Think of each as a support vector from the training set and the classification function as a weighted similarity to instances from this set. If a new test coffee is similar to one called "good" from the training data, we might want to predict "good" for this one too. The same applies to coffees associated with the "bad" label.

The key is the understanding that the features from our data matrix are no longer the dimensions underlying this classification space. We have redefined the basis in terms of landmarks or support vectors. New coffees are placed along dimensions defined by previous training instances. As Pedro Domingos notes (at 33 minutes into the talk), the algorithm relies on analogy, not unlike case-based reasoning. Our new dimensions are more intricate compressed representations of the original features. If this reminds you of archetypal analysis, then you may be on the right track or at least not entirely lost.

Monday, May 9, 2016

The Mad Hatter Explains Support Vector Machines

"Hatter?" asked Alice, "Why are support vector machines so hard to understand?" Suddenly, before you can ask yourself why Alice is studying machine learning in the middle of the 19th century, the Hatter disappeared. "Where did he go?" thought Alice as she looked down to see a compass painted on the floor below her. Arrows pointed in every direction with each one associated with a word or phrase. One arrow pointed toward the label "Tea Party." Naturally, Alice associated Tea Party with the Hatter, so she walked in that direction and ultimately found him.

"And now," the Hatter said while taking Alice's hand and walking through the looking glass. Once again, the Hatter was gone. This time there was no compass on the floor. However, the room was filled with characters, some that looked more like Alice and some that seemed a little closer in appearance to the Hatter. With so many blocking her view, Alice could see clearly only those nearest to her. She identified the closest resemblance to the Hatter and moved in that direction. Soon she saw another that might have been his relative. Repeating this process over and over again, she finally found the Mad Hatter.

Alice did not fully comprehend what the Hatter told her next. "The compass works only when the input data separates Hatters from everyone else. When it fails, you go through the looking glass into the observation space where all we have is resemblance or similarity. Those who know me will recognize me and all that resemble me. Try relying on a feature like red hair and you might lose your head to the Red Queen. We should have some tea with Wittgenstein and discuss family resemblance. It's derived from features constructed out of input that gets stretched, accentuated, masked and recombined in the most unusual ways."

The Hatter could tell that Alice was confused. Reassuringly, he added, "It's a acquired taste that takes some time. We know we have two classes that are not the same. We just can't separate them from the data as given. You have to look it in just the right way. I'll teach you the Kernel Trick." The Mad Hatter could not help but laugh at his last remark - looking at in in just the right way could be the best definition of support vector machines.

Note: Joseph Rickert's post in R Bloggers shows you the R code to run support vector machines (SVMs) along with a number of good references for learning more. My little fantasy was meant to draw some parallels between the linear algebra and human thinking (see Similarity, Kernels, and the Fundamental Constraints on Cognition for more). Besides, Tim Burton will be releasing soon his movie Alice Through the Looking Glass, and the British Library is celebrating 150 years since the publication of Alice in Wonderland. Both Alice and SVMs invite you to go beyond the data as inputted and derive "impossible" features that enable differentiation and action in a world at first unseen.