Wednesday, July 15, 2015

Seeing Data as the Product of Underlying Structural Forms

Matrix factorization follows from the realization that nothing forces us to accept the data as given. We start with objects placed in rows and record observations on those objects arrayed along the top in columns. Neither the objects nor the measurements need to be preserved in their original form.

It is helpful to remember that the entries in our data matrix are the result of choices made earlier for everything that can be recorded is not tallied. We must decide on the unit of analysis (What objects go in the rows?) and the level of detail in our measurements (What variables go in the columns?). For example, multilevel data can be aggregated as we deem appropriate, so that our objects can be classroom means rather than individual students nested within classrooms and our measures can be total number correct rather than separate right and wrong for each item on the test.

Even without a prior structure, one could separately cluster the rows and the columns using two different distance matrices, that is, a clustering of the rows/columns with distances calculated using the columns/rows. As an example, we can retrieve a hierarchical cluster heat map from an earlier post created using the R heatmap function.

Here, yellow represents a higher rating, and lower ratings are in red. Objects with similar column patterns would be grouped together into a cluster and treated as if they constitute a single aggregate object. Thus, when asked about technology adoption, there appears to be a segment toward the bottom of the heatmap who foresee a positive return to investment and are not concerned with potential problems. The flip side appears toward the top with the pattern of higher yellows and lower reds suggesting more worries than anticipated joys. The middle rows seems to contain individuals falling somewhere between these two extremes.

A similar clustering could be performed for the columns. Any two columns with similar patterns down the rows can be combined and an average score calculated. We started with 8 columns and could end with four separate 2-column clusters or two separate 4-column clusters, depending on where we want to draw the line. A cutting point for the number of row clusters seems less obvious, but it is clear that some aggregation is possible. As a result, we have reordered the rows and columns, one at a time, to reveal an underlying structure: technology adoption is viewed in terms of potential gains and losses with individuals arrayed along a dimension anchored by gain and loss endpoints. 

Before we reach any conclusion concerning the usefulness of this type of "dual" clustering, we might wish to recall that the data come from a study of attitudes toward technology acceptance with the wording sufficiently general that every participant could provide ratings for all the items. If we had, instead, asked about concrete steps and concerns toward actual implementation, we might have found small and large companies living in different worlds and focusing on different details. I referred to this as local subspaces in a prior post, and it applies here because larger companies have in-house IT departments and IT changes a company's point-of-view.

To be clear, conversation is filled with likes and dislikes supported by accompanying feelings and beliefs. This level of generality permits us to communicate quickly and easily. The advertising tells you that the product reduces costs, and you fill in the details only to learn later that you set your cost-savings expectations too high.

We need to invent jargon in order to get specific, but jargon requires considerable time and effort to acquire, a task achieved by only a few with specialized expertise (e.g., the chief financial officer in the case of cost cutting). The head of the company and the head of information technology may well be talking about two different domains when each speaks of reliability concerns. If we want the rows of our data matrix to include the entire range of diverse players in technological decision making while keeping the variables concrete, then we will need a substitute for the above "dual" scaling.

We may wish to restate our dilemma. Causal models, such as the following technology acceptance model (TAM), often guide our data collection and analysis with all the inputs measured as attitude ratings.

Using a data set called technology from the R package plspm, one can estimate and test the entire partial least squares path model. Only the latent variables have been shown in the above diagram (here is a link to a more complete graphic), but it should be clear from the description of the variables in the technology data set that Perceived Usefulness (U) is inferred from ratings of useful, accomplish quickly, increase productivity and enhance effectiveness. However, this is not a model of real-world adoption that depends on so many specific factors concerning quality, cost, availability, service, selection and technical support. The devil is in the details, but causal modeling struggles with high dimensional and sparse data. Consequently, we end up with a model of how people talk about technology acceptance and not a model of the adoption process pushed forward by its internal advocates and slowed by its sticking points, detours and dead ends.

Yet, the causal model is so appealing. First comes perception and then follows intention. The model does all the heavy lifting with causation asserted by the path diagram and not discovered in the data. All I need is a rating scale and some general attitude statements that everyone can answer because they avoid any specifics that might result in a DK (don't know) or NA (not applicable). Although ultimately the details are where technology is accepted or rejected, they just do not fit into the path model. We would need to look elsewhere in R for a solution.

As I have been arguing for some time in this blog, the data that we wish to collect ought to have concrete referents. Although we begin with detailed measures, it is our hope that the data matrix can be simplified and expressed as the product of underlying structural forms. At least with technology adoption and other evolving product categories, there seems to be a linkage between product differentiation and consumer fragmentation. Perhaps surprisingly, matrix factorization with its roots as a computational routine from linear algebra seems to be able to untangle the factors responsible for such coevolving networks.


  1. Idea: you might be interested in the recent d3heatmap R package.

    1. Thanks for the recommendation. I saw Joe Cheng's R-bloggers post for d3heatmap last month, but I did not install the package until seeing your comment. Heatmaps are designed to visualize the data matrix and reveal the underlying structure. The interactive viewer in d3heatmap is a major addition.