Wednesday, May 13, 2015

What is Data Science? Can Topic Modeling Help?

Predictive analytics often serves as an introduction to data science, but it may not be the best exemplar given its long history and origins in statistics. David Blei, on the other hand, struggles to define data science through his work on topic modeling and latent Dirichlet allocation. In Episode 10 of Talking Machines, Blei discusses his attempt to design a curriculum for the Data Science Institute at Columbia University. The interview starts at 9:20. If you do not wish to learn about David's career, you can enter the conversation at 13:10. However, you might want to listen all the way to the end because we learn a great deal about data science by hearing how topic modeling is applied across disciplines. Over time data science will be defined as individuals calling themselves "data scientists" change our current practices.

The R Project for Statistical Computing assists by providing access to a diverse collection of applications across fields with differing goals and perspectives. Programming forces us into the details so that we cannot simply talk in generalities. Thus, topic modeling certainly allows us to analyze text documents, such as newspapers or open-ended survey comments. What about ingredients in food recipes? Or, how does topic modeling help us understand matrix factorization? The ability to "compare and contrast" marks a higher level of learning in Bloom's taxonomy.

While visiting Talking Machines, you might also want to download the MP3 files for some of the other episodes. The only way to keep up with the increasing number of R packages is to understand how they fit together into some type of organizational structure, which is what a curriculum provides.

You can hear Geoffrey Hinton, Yoshua Bengio and Yann LeCun discuss the history of deep learning in Episodes 5 and 6. If nothing else, the conversation will help you keep up as R adds packages for deep neural networks and representation learning. In addition, we might reconsider our old favorites, like predictive analytics, with a new understanding. For example, what may be predictive in choice modeling might not be the individual features as given in the product description but the holistic representation as perceived by a consumer with a history of similar purchases in similar situations. We would not discover that by estimating separate coefficients for each feature as we do with our current hierarchical Bayesian models. Happily, we can look elsewhere in R for models that can learn such a product representation.

1 comment:

  1. There are many other such parameters which go into the nitty-gritty details and evaluate the results like recursive abstraction. This helps to remove any discrepancy that may have crept in during finalizing of the result.

    ReplyDelete