In the last five months, I’ve done few things that have made updating this blog a lot harder than anticipated: moved to New York, started a new job, got a dog, and discovered Kaggle. What’s Kaggle? Perhaps the best description comes from the Kaggle’s own site:
Kaggle is an innovative solution for statistical/analytics outsourcing. We are the leading platform for predictive modeling competitions. Companies, governments and researchers present datasets and problems – the world’s best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model.
Competing on Kaggle is as easy as registering and uploading a .csv. Feedback on submissions is instant, and a public leaderboard constantly updated as participants submit. The available competitions provide nice range of problems to tackle, including predicting insurance claims, mapping dark matter, and modeling edits to Wikipedia.
Before I had ever heard of Kaggle, back in early 2010, I participated in my first modeling/prediction contest through the now-defunct Analytics X competition. In that contest, the goal was to predict the spatial distribution of homicides in Philadelphia. I spent many hours working on a single submission for that competition, and I remember being horrified by how poorly it ranked. Participant submissions were ranked according to the amount of prediction error (RMSE, I think), and mine was abysmally low. My precious mutlilevel model was a total flop, and based on the relatively low error rates from everyone else on the public leaderboard, all of my fellow competitors were using some kind of voodoo to make their predictions. Confused and dismayed, I stopped working on the Analytics X competition and focused on finishing graduate school.
My Analytics X experience continued to haunt me. What happened? I did all of the right things according to my statistics textbooks! So, I started poking around for clues to my (and my model’s) failure. I began following discussions at r/MachingLearning and Cross-Validated, picked up Berk’s gentle Statistical Learning from a Regression Perspective, struggled through a lot of The Elements of Statistical Learning (free pdf here!). At some point, I stumbled upon a wonderful presentation (video link here) titled “Getting in Shape for the Sport of Data Science” by Jeremy Howard, Kaggle’s President and Chief Scientist, and things started to take shape. Armed with a few new tricks, I was determined to give the prediction game another go.
Since then, I’ve entered two Kaggle competitions (the Photo Quality Prediction, which ended recently, and Don’t Get Kicked!, which is still ongoing) and have had a lot of fun. I’m very far from even placing a competition, but I’m fairing much better thanks to the tools and techniques outside of the standard social science toolbox.
I can’t overstate the value of approaching statistical modeling with the goal of accurate prediction rather than hypothesis testing, even if hypothesis testing is your job. The weaknesses of the traditional social science methods (linear regression, generalized linear models) become obvious very quickly. You are forced to try new techniques, scour your data for patterns, and get creative with transforming variables. You can’t get too attached to any given model. Instead, you need to be critical with every aspect of your model, do lots of tinkering, and be prepared to trash it if you hit a wall.
Many Kaggle competitors submit 50+ entries throughout a given contest. I’ve already submitted about 25 attempts in a current competition, making small changes each time. This iterative approach encourages a deeper level of understanding of the data than, say, testing a model’s goodness-of-fit or whether a certain coefficient was different from zero. I think this is counterintuitive to those who assume that the world of machine learning and statistical prediction is all black boxes without much concern for the underlying processes that generated the data.
My experience with Kaggle has left me wondering why predictive accuracy isn’t more important in social science. Sure, good prediction does not equal a good theory, but shouldn’t a good theory result in good predictions? Yet I can’t remember the last time I ever saw any kind of cross-validation in a paper. If there was an academic psychology journal that hosted Kaggle-style competitions between different theoretical camps, I’d read it every month. I’d even pay for it! Until then, I’ll continue reading No Free Hunch.