Monday, September 22, 2014

What is Cluster Analysis? A Projective Test

Supposedly, projective tests (e.g., the inkblots of psychoanalysis) contain sufficient ambiguity that "what you see" reveals some aspect of your thinking that has escaped your awareness. Although the following will provide no insight into your neurotic thoughts or feelings, it might help separate two different way of performing and interpreting cluster analysis.

A light pollution map of the United States, a picture at night from a satellite orbiting the earth, is shown below.

Which of the following two representations more closely matches the way you think of this map?

Do you consider population density to be the mixture of distributions represented by the red spikes in the first option?

Or perhaps this mixture model is too passive for you, so that you prefer the air traffic representation in the second option showing separate airplane locations at some point in time.

The mclust package in R provides the more homeostatic first representation using density functions. Because mclust adjusts the shape of each normal distribution in the mixture, one can model the Northeast corridor from Boston to Philadelphia with a single cluster. Moreover, the documentation enables you to perform the analysis without excessive pain and to understand how finite mixture models work. If you need a video lecture on Gaussian mixtures, MathematicalMonk on YouTube is the place to start (aka Jeff Miller).

On the other hand, if airplanes can be considered as messages passed between nodes with greater concentrations (i.e., cities with airports), then the R package performing affinity propagation, apcluster, offers the more "self-organizing" model shown in the second option with many possible ways of defining similarity or affinity. Ease of use should not be a problem with a webinar, a comprehensive manual, and a link to the original Science article. However, the message propagation algorithm requires some work to comprehend the details. Fortunately, one can run the analysis, interpret the output, and know enough not to make any serious mistakes without all the computational intricacies.

And the true representation is? As a marketer, I see it as a dynamic process with concentrations supported by the seaports, rivers, railroad tracks, roads, and airports that served commerce over time. Population clusters continually evolve (e.g., imagine Las Vegas without air travel).  They are not natural kinds revealed by craving nature at its joints. Diversity comes in many shapes and forms, each requiring its own model with its unique assumptions concerning the underlying structures. More importantly, cluster analysis serves many different purposes with each setting its own criteria. Haven't we learned that one size does not fit all?

1 comment:

  1. -- Haven't we learned that one size does not fit all?

    No. Nor will we. The goal of "marketing", aka Jobs' Reality Distortion Field, is to convince diverse individuals/groups that some fixed widget is just what they need. There's always more moolah to be made shifting 1,000,000 units of a standard widget than 100,000 of 10 (sufficiently) different widgets. I suppose this notion of "diversity" started, most recently, with the myth of the Long Tail. Amazon has discovered that it just doesn't work.