Don’t use the formula interface with randomForest()

Recently I was using randomForest() in R with a dataset containing about 8000 predictors. With smaller datasets, I could get away with using the formula interface (e.g., y ~ x1 + x2 … + x5) when fitting a model, but in this case, I couldn’t even get randomForest() to finish after handing it a giant formula.

Some googling led me to a great post about randomForest() by Chris Raimondi on the Heritage Health Prize forums that advised against the formula interface. Some posts at StackOverflow gave similar admonishments. And, of course, the manual for randomForest() suggests against it in cases with many predictors.

The alternative to the formula interface is to supply your response and predictors separately using the’ y=’ and ‘x=’ arguments and column indices. The toy example below demonstrates the gains from using this alternative. With 1000 predictors, randomForest() finishes in about 28 seconds with the formula interface and just under 6 seconds using the indexing alternative!

Try the code below to see for yourself:

library(randomForest)
 
# Create a dataframe with a categorical response and 1000 random continuous predictors
data <- data.frame(y=gl(2, 50, labels = 0:1),
                   matrix(rnorm(1e05), nrow=length(y), 
                   ncol=1000)) 
 
# Add some names
names(data) <- c("y", paste("X", c(1:1000), sep=""))
 
# Create a formula to describe the model. my.rf.formula is 'y ~ X1 + X2 + .... X999 + X1000'
my.rf.formula <- as.formula(paste("y ~ ", paste(names(data)[-1], collapse=" + "), sep="")) 
 
# First, run randomForest with the formula interface
system.time(rf.1 <- randomForest(my.rf.formula, mtry=10, ntree=2000, data=data, type="classification"))
# elapsed time: 28.24 s
 
# Now try it using column indicies 
system.time(rf.2 <- randomForest(y=data[,1], x=data[,-1], mtry=10, ntree=2000, type="classification"))
# elapsed time: 5.97 s

My magical Baader-Meinhof moment at the Fulton St. station

As I left the office this evening and headed towards the subway station, I turned on the most recent episode of one of my favorite new podcasts: Philosophy Bites. Very early in the discussion between the hosts, the guest philosopher Alain de Botton introduces the Greek word akrasia:

The ancient greeks have this wonderful word, akrasia – weakness of will. Arguments and positions can be very interesting and convincing, but we all suffer from weakness of will. And to combat weakness of will, you need training – mental training.

I hear this as I’m walking down the staircase into the subway station, half of my mind attending to the discussion and the other half trying to visualize the word akrasia in Greek letters. I pop through the turnstiles, peek down the tracks, then turn towards the wall behind me to see this:

Akrasia

Bizarre. This is easily my best personal example of the Baader-Meinhof Phenomenon.

How I Didn’t Do It: Finishing 36th in the Don’t Get Kicked! Competition.

My first run in a Kaggle competition finally ended recently with the closing of the “Don’t Get Kicked!” contest. It was an excellent learning experience and I had a lot of fun in the process (at the expense of sleep). I was pleasantly surprised to have ended up in 36th place out of 588 teams.

The task in this competition was to predict “bad buys” among thousands of cars purchased through online auctions. Contestants were given information on attributes such as purchase price, make and model of the car, trim level, odometer reading, date of purchase, state of origin, and so on. The original/training data included about 40 variables (including the “bad buy” outcome) on roughly 70,000 cars, and the test data had the same information on about 40,000 cars (minus the outcome). Contestants submit predictions, in the form of probabilities, about which of these 40,000 cars will turn out to be bad buys. Predictions are instantly scored against reality, and each set of predictions is scored using the Gini coefficient.

I submitted 30 sets of predictions over the course of the competition, using a variety of methods. Below I’ve plotted my results over time, along with the type of method behind each submission. The dashed grey line along the top is the score of the winning submission.

full size

Notice the big jump around October 10th? That was due to a novel insight: reading the competition directions. I had been submitting predicted classes (0s and 1s), not predicted probabilities between 0 and 1. Here’s the same plot, excluding the erroneous submissions:

full size

My best score resulted from a combination of a lot of boosted trees, random forests, and additive models. I used a weighted average of the predictions from each method, choosing the weights using cross-validation. Everything was done in R: boosting with the gbm package, random forests with the randomForests package, and additive models with the gam package.

My basic strategy was to create lots of plots and descriptive statistics, then create useful transformations and combinations of variables that could be fed into models as predictors. My predictor set grew from about 10 to 51 over the course of the competition:

full size

I had created about 150 predictors in the end but only about 50 proved to be useful. As you can see below, boosting was superior to most other methods given the same predictors.

full size

While a lot of fun, I’m glad to take a break from Kaggling this weekend and look forward to seeing what the winners did!

Struck by Kaggle

In the last five months, I’ve done few things that have made updating this blog a lot harder than anticipated: moved to New York, started a new job, got a dog, and discovered Kaggle. What’s Kaggle? Perhaps the best description comes from the Kaggle’s own site:

Kaggle is an innovative solution for statistical/analytics outsourcing. We are the leading platform for predictive modeling competitions. Companies, governments and researchers present datasets and problems – the world’s best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model.

Competing on Kaggle is as easy as registering and uploading a .csv. Feedback on submissions is instant, and a public leaderboard constantly updated as participants submit. The available competitions provide nice range of problems to tackle, including predicting insurance claims, mapping dark matter, and modeling edits to Wikipedia.

Before I had ever heard of Kaggle, back in early 2010, I participated in my first modeling/prediction contest through the now-defunct Analytics X competition. In that contest, the goal was to predict the spatial distribution of homicides in Philadelphia. I spent many hours working on a single submission for that competition, and I remember being horrified by how poorly it ranked. Participant submissions were ranked according to the amount of prediction error (RMSE, I think), and mine was abysmally low.  My precious mutlilevel model was a total flop, and based on the relatively low error rates from everyone else on the public leaderboard, all of my fellow competitors were using some kind of voodoo to make their predictions. Confused and dismayed, I stopped working on the Analytics X competition and focused on finishing graduate school.

My Analytics X experience continued to haunt me. What happened? I did all of the right things according to my statistics textbooks! So, I started poking around for clues to my (and my model’s) failure. I began following discussions at r/MachingLearning and Cross-Validated, picked up Berk’s gentle Statistical Learning from a Regression Perspective, struggled through a lot of The Elements of Statistical Learning (free pdf here!).  At some point, I stumbled upon a wonderful presentation (video link here) titled “Getting in Shape for the Sport of Data Science” by Jeremy Howard, Kaggle’s President and Chief Scientist, and things started to take shape. Armed with a few new tricks, I was determined to give the prediction game another go.

Since then, I’ve entered two Kaggle competitions (the Photo Quality Prediction, which ended recently, and Don’t Get Kicked!, which is still ongoing) and have had a lot of fun. I’m very far from even placing a competition, but I’m fairing much better thanks to the tools and techniques outside of the standard social science toolbox.

I can’t overstate the value of approaching statistical modeling with the goal of accurate prediction rather than hypothesis testing, even if hypothesis testing is your job. The weaknesses of the traditional social science methods (linear regression, generalized linear models) become obvious very quickly. You are forced to try new techniques, scour your data for patterns, and get creative with transforming variables. You can’t get too attached to any given model. Instead, you need to be critical with every aspect of your model, do lots of tinkering, and be prepared to trash it if you hit a wall.

Many Kaggle competitors submit 50+ entries throughout a given contest. I’ve already submitted about 25 attempts in a current competition, making small changes each time. This iterative approach encourages a deeper level of understanding of the data than, say, testing a model’s goodness-of-fit or  whether a certain coefficient was different from zero. I think this is counterintuitive to those who assume that the world of machine learning and statistical prediction is all black boxes without much concern for the underlying processes that generated the data.

My experience with Kaggle has left me wondering why predictive accuracy isn’t more important in social science. Sure, good prediction does not equal a good theory, but shouldn’t a good theory result in good predictions? Yet I can’t remember the last time I ever saw any kind of cross-validation in a paper. If there was an academic psychology journal that hosted Kaggle-style competitions between different theoretical camps, I’d read it every month. I’d even pay for it! Until then, I’ll continue reading No Free Hunch.

 

 

 

The Garmin Forerunner 305 – it’s not just for running.

I was lucky enough to receive the awesome gift of a Garmin Forerunner 305 GPS watch recently, and it quickly became an essential piece of running gear. It collects all sorts of fun data from each workout, but my favorite feature is the heartrate monitor, which wirelessly transmits heartrate information to the watch from an elastic band worn on the torso. Having such objective feedback about how hard the heart is working can be very useful info mid-run, but why limit the Garmin 305 to my runs?

I’ve been wanting to play around with the Garmin data in R anyway, so for a fun project, I tracked my heartrate during my oral defense of my dissertation last week. I started tracking about 50 minutes before the defense started and ended shortly after the defense, which lasted a little over two hours. The Garmin 305 saves all workout data as .tcx files, and I used the free software TCX Converter to convert my “workout” file to a .csv. From there, it was pretty straightforward to plot everything out (I used functions from Hadley Wickham’s stringr package and a smoother from Frank Harrell’s Hmisc package; R code below, formatted with Pretty R). Everything went pretty smoothly during the defense (the proposal is generally much tougher, I’m told), and it looks like I calmed down and settled in after the 25 minutes or so. (Edit: My resting heartrate is just under 60 bpm.) Perhaps the most stressful part of the whole process was getting the projector to cooperate with my poor old work laptop for the 20 minutes prior to the presentation. Subjectively, I felt relatively calm during while talking, but I was surprised to see such a drop in my heartrate directly after it was finished.

I really wish I had this kind of data for all my past talks and presentations, because I’m sure my heartrate was skyrocketing during these kind of things five years ago. Training works!

library(stringr)
library(Hmisc)

defense <- read.table("defense.csv", header=T, sep=",")

## Time is in a big string, so str_sub from the stringr library
## pulls out hours, minutes, and seconds
defense$hour <- as.numeric(str_sub(defense$TIME, start=12, end=13))
defense$minute <- as.numeric(str_sub(defense$TIME, start=15, end=16))
defense$second <- as.numeric(str_sub(defense$TIME, start=18, end=19))

## Combine three time variables to create a "total minutes passed" variable
## Time started in the 17:00 hour 
defense$totalminutes <- (defense$hour - 17)*60 + defense$minute + defense$second/60

par(mar=c(5,4,3,2))
plot(defense$totalminutes, defense$HR, pch=16, col="grey95", axes=F,
    ylim=c(50,115), xlim=c(20, 202), ylab="Heartrate (bpm)", xlab="Time\n(Hours:Minutes)")

# Create custom axes, labels and tickmarks
x.axis.ticks <- c(14, 24, 44, 64, 84, 104, 124, 144, 164, 184, 194, 204)
x.axis.labels <- c(" ", "-0:40m", "-0:20m", "0:00", "0:20","0:40","1:00","1:20",
                   "1:40", "2:00", " ", "2:20")
y.axis.ticks <- c(50, 60, 70, 80, 90, 100, 110)
y.axis.labels <- c(50, 60, 70, 80, 90, 100, 110)
axis(side=2, pos=14, las=2,lty=1, lwd=.5, at=y.axis.ticks,
     labels=y.axis.labels, tck=-.02, cex.axis=1.0)
axis(side=1, pos=50, las=0,lty=1, lwd=.5, at=x.axis.ticks,
     labels=x.axis.labels, tck=-.02, cex.axis=1.0)

# Two vertical gridlines with labels
segments(x0=64, x1=64, y0=50, y1=110, lwd=2, col="grey60", lty=1)
text(x=64, y=113, "Defense\nstarts", cex=.9)
segments(x0=194, x1=194, y0=50, y1=110, lwd=2, col="grey60", lty=1)
text(x=194, y=113, "Defense\nends", cex=.9)

# Create horizontal gridlines
for (i in seq(60,110,10))
  {
    segments(x0=14, x1=204, y0=i, y1=i, lwd=1, col="grey70", lty=1)
  }

# Just for fun, bootstrap a loess smoother (using plsmo() from Hmisc library)
for (i in 1:1000)
  {
    myboot <- sample(1:nrow(defense), replace=TRUE)
    bootdata <- defense[myboot,]
    plsmo(bootdata$totalminutes, bootdata$HR, col="grey70", f=.05, add=TRUE)
  }

# Draw one overlapping thick red smoothed line
plsmo(x=defense$totalminutes, y=defense$HR, col="red3",lty=1,
        lwd=5, f=.05, add=TRUE, datadensity=F)

# Label various events

arrows(x0=34, y0=95, x1=15, y1=83, lwd=2, angle=20, length=.1)
text(x=34, y=97, "Arrive at office", cex=.7)

arrows(x0=36, y0=57, x1=36, y1=62, lwd=2, angle=20, length=.1)
text(x=36, y=54, "Projector\nbroken...", cex=.7