Heatmap of the hottest final digits of historical football scores.

I heard about a friendly office pool based on the outcome of the Super Important Internationally Televised Football Championship happening this weekend. Rather than betting on the actual winner of the game, everyone picks a handful of combination of two single digits, like (2, 1) or (9, 3). Nobody can have pick the same combinations, but if I choose (2, 1), then (1, 2) is still up for grabs. Each digit corresponds to the last digit in each team’s final score, so if I choose (2, 1) and the final score is 42-21, I win.

It’s reasonable to think that some combinations are more likely than others, but since I’m not very familiar with football, I didn’t have a good sense of what combinations would make the best bets. Naturally, I had to throw a bunch of data into R and make a graph! Below you can see the probabilities of all possible combinations. The data is from 14,417 professional football scores found on http://www.pro-football-reference.com.

Not all of the scores are from NFL games — there are some scores from AFPA, the AAFC, the AFL in here, too. Maybe this will add some funny outcomes in there if they had different scoring systems. Still, the results look reasonable at first glance. Scores ending in 3, 4, and 7 seem to be the best bets. That (7, 0) is so hot right now. (7, 0). R code is below.

FinalDigitsPlot

 

library(stringr)
library(ggplot2)

data <- read.table("nflscores.csv", header=T, sep=",") # I grabbed all scores from http://www.pro-football-reference.com, 

## This is gonna get ugly

data <- cbind(data, t(matrix(as.numeric(unlist(str_split(data$Score, pattern="-"))), nrow=2, ncol=997)))

names(data) <- c("rank", "score", "ptsw", "ptsl", "pttot", "ptdif", "count", "lastdate", "score1", "score2")

data$d1 <- data$score1 %% 10 # get last digits
data$d2 <- data$score2 %% 10

scores.df <- expand.grid(lastdigit1=c(0:9), # create a grid of all combinations
                         lastdigit2=c(0:9))

scores.df$count <- NA

## I warned you!

for(i in 0:9){   # loop through 
  for(j in 0:9){
      scores.df$count[which(scores.df$lastdigit1==i & scores.df$lastdigit2==j)] <- sum(data$count[which(data$d1==i & data$d2==j)])
  }
}

scores.df$prob <- scores.df$count / sum(scores.df$count) # prob = freq of a given outcome/total number of games

svg("FinalDigitsPlot.svg", width=450, height=390)
score.plot <- ggplot(data=scores.df, aes(x=lastdigit1, y=lastdigit2, colour=prob))

score.plot + geom_point(shape=15, size=15) + # rather than use heatmap(), I plotted squares colored by prob
        theme_set(theme_bw()) +              
        scale_colour_gradient(name="Probability", low="grey70", high="red") +
        scale_x_continuous(limits=c(0, 9), breaks=0:9) +
        scale_y_continuous(limits=c(0, 9), breaks=0:9) + 
        labs(x="Last digit (winner)", y="Last digit (loser)") + 
        ggtitle("Final Digits from 14,417 Pro Football Games") 
dev.off()

Created by Pretty R at inside-R.org

 

A miracle occurred (p < .10)

While reading an an article about statistical procedures from American Psychologist, I came across the following in a footnote

Interestingly, Enrico Fermi, the great physicist, thought p = .10 to be the wise operational definition of a “miracle” (Polanyi, 1961) …

Intrigued, I found the cited Polanyi article in a 1962 issue of journal Philosophy. Discussing the notion of unreasonable chances, Polyani notes that

The late Enrico Fermi is reported to have said that a miracle is an event the chances of which are less than one in ten. (page 3)

Without any source to follow, I tried Googling to find the original source of the “Fermi miracle” probability figure. This lead to a footnote in collection titled Cosmochemical Evolution and the Origins of Life, this time summarizing a comment by Harold Urey

The evolution of life is not a miracle in the ordinary sense of the word, but it might be a ‘Fermi-Miracle’. A first order Fermi-Miracle is defined as an event that occurs with a probability of 1/10. A Fermi-Miracle n-th order is an event that occurs with the probability of (1/10)^n. (page 9)

I also found other passing references in forum discussions to the miracle (p < .10) idea, each time attributed to Fermi, and one reference even lowered the threshold of a Fermi-Miracle to .2. But just as multiple sources seemed to converge, I came across the following quote attributed to Fermi, allegedly sourced to Owen Chamberlain,

A three standard deviation is a statistical fluctuation; a five standard deviation effect is a miracle.

This is, of course, a much more stringent cutoff, and it seems impossible that Fermi could have fluctuated between the p = .10 and the “five standard deviation” (or p = .0000003, roughly) threshold. Given that there are so many more references to the p = .10 cutoff, I find it more believably attributable to Fermi than the “five standard deviation” rule, but I’m surprised that I wasn’t able to nail it down to a sourced quote from Fermi. I suspect that there are many out there that are familiar enough with Fermi to clear this up, right?

Kaggle Post-Mortem: Psychopathy Prediction Blowout

Can Twitter expose psychopathy?

Over the last month and a half, the Online Privacy Foundation hosted a Kaggle competition, in which competitors attempted to predict psychopathy scores based on abstracted Twitter activity from a couple thousand users. One of the goals of the competition is to determine how much information about one’s personality can be extracted from Twitter, and by hosting the competition on Kaggle, the Online Privacy Foundation can sit back and watch competitors squeeze every bit of predictive ability out of the data, trying to predict the psychopathy scores of 1,172 Twitter users. Competitors can submit two sets of predictions each day, and each submission is scored from 0 (worst) to 1 (best) using a metric known as “average precision“. Essentially, a submission that predicts the correct ranking of psychopathy scores across all Twitter accounts will receive a score of 1.

Over the course of the contest, I made 42 submissions, making it my biggest Kaggle effort yet. Each submission is instantly scored and competitors are ranked on a public leaderboard according to their best submission. However, the public leaderboard score isn’t actually the “true” score – it is only an estimate based on a small portion of the submission. When the competition ends, five submissions from each competitor are compared to the full set of test data (all 1,172 Twitter accounts), and the highest scoring submission from each user is used to calculate the final score. By the end of the contest, I had slowly worked my way up to 2nd place on the public leaderboard, shown below.

 

Top of the public leaderboard. The public leaderboard scores are calculated during the competition by comparing users’ predictions to a small subset of the test data.

 

I held this spot for the last week and felt confident that I would maintain a decent spot on the private or “true” leaderboard. Soon after the competition closed, the private leaderboard was revealed. Here’s what I saw at the top:

Top of the private leaderboard. The private leaderboard is the “real” leaderboard, revealed after the contest is closed. Scores are calculated by comparing users’ predictions to the full set of test data.

 

Where’d I go? I scrolled down the leaderboard…. further… and further… and finally found my name:

 

My place on the private leaderboard. I dropped from 2nd place on the public leaderboard to 52nd on the private leaderboard. Notice I placed below the random forest benchmark!

 

Somehow I managed to fall from 2nd all the way down to 52nd! I wasn’t the only one who took a big fall: the top five users on the public leaderboard ended up in 64th, 52nd, 58th, 16th, and 57th on the private leaderboard, respectively. I even placed below the random forest benchmark, a solution publicly available from the start of the competition.

 

What happened?

After getting over the initial shock of dropping 50 places, I began sifting through the ashes to figure out what went so wrong. I think those with more experience already know, but one clue is in the screenshot of the pack leaders on the public leaderboard. Notice that the top five users, including myself, have a lot of submissions. For context, the median number of submissions in this contest was six. Contrast this with the (real) leaders on the private leaderboard – most have less than 12 submissions. Below, I’ve plotted the number of entries from each user against their final standing on the public and private leaderboards and added a trend line to each plot.

 

 

On the public leaderboard, more submissions are consistently related to a better standing.  It could be that the public leaderboard actually reflects the amount of brute force from a competitor rather than predictive accuracy. If you throw enough mud at a wall, eventually some of it will start to stick. The problem is that submissions that score well using this approach probably will not generalize to the full set of test data when the competition closes. It’s possible to overfit the portion of test data used to calculate the public leaderboard, and it looks like that’s exactly what I did.

Compare that trend in the public leaderboard to the U-shaped curve in the plot of the private leaderboard and number of submissions. After about 25 submissions or so, private leaderboard standings get worse with the number of submissions.

Poor judgments under uncertainty

Overfitting the public leaderboard is not unheard of, and I knew that it was a possibility all along. So why did I continue to hammer away at competition with so many submissions, knowing that I could be slowly overfitting to the leaderboard?

Many competitors with estimate the quality of their submissions prior to uploading them using cross-validation. Because the public leaderboard is only based on a small portion of the test data, it is only a rough estimate of the true quality of a submission, and cross-validation gives a sort of second opinion of a submission’s quality. For most of my submissions, I used 10-fold cross-validation to estimate the average precision. So throughout the contest, I could observe both the public leaderboard score and my own estimated score from cross-validation. After the contest closes, Kaggle reveals the private or “true” score of each submission. Below, I’ve plotted the public, private, and cv-estimated score of each submission by date.

 

Scores of my 42 submissions to the Psychopathy Prediction contest over time. Each submission has three corresponding scores: a public score, a private score, and a score estimated using 10-fold cross validation (cv). The public score of a submission is calculated instantly upon submission by comparing a subset of the predictions to a corresponding subset of the test data. The private score is the “true” score of a submission (based on the entire set of test data) and is not observable until the contest is finished. Every submission has a public and private score, but a few submissions are missing an CV-estimated score. The dotted line is the private score of the winning submission.

 

There are a few things worth pointing out here:

  • My cross-validation (CV) estimated scores (the orange line) gradually improve over time. So, as far as I know, my submissions are actually getting better as I go.
  • The private or “true” scores actually get worse over time. In fact, my first two submissions to the contest turned out to be my best (and I did not choose them as any of my final five submissions)
  • The public scores reach a peak and then slowly get worse at the end of the contest.
  • It is very difficult to see a relationship between any of these trends.

Below, I’ve replaced the choppy lines with smoothed lines to show the general trends.

 

An alternate plot of the submission scores over time using smoothers to depict some general trends.

 

Based on my experience with past contests, I knew that the public leaderboard could not be trusted fully, and this is why I used cross-validation. I assumed that there was a stronger relationship between the cross-validation estimates and the private leaderboard than between the public and private leaderboard. Below, I’ve created scatterplots to show the relationships between each pair of score types.

 

Scatterplots of three types of scores for each submission: 10-fold CV estimated score, public leaderboard score, and private or “true” score.

 

The scatterplots tell a different story. It turned out that my cross-validation estimates were not related to the private scores at all (notice the horizontal linear trends in those scatterplots), and the public leaderboard wasn’t any better.  I already guessed that the public leaderboard would be a poor estimate of the true score, but why didn’t cross-validation do any better?

I suspect this is because as the competition went on, I began to use much more feature selection and preprocessing. However, I made the classic mistake in my cross-validation method by not including this in the cross-validation folds (for more on this mistake, see this short description or section 7.10.2 in The Elements of Statistical Learning).  This lead to increasingly optimistic cross-validation estimates. I should have known better, but under such uncertainty, I fooled myself into accepting the most self-serving description of my current state. Even worse, I knew not to trust the public leaderboard, but when I started to edge towards to the top, I began to trust it again!

Lessons learned

In the end, my slow climb up the leaderboard was due mostly to luck. I chose my five final submissions based on cross-validation estimates, which turned out to be a poor predictor of true score. Ultimately, I did not include my best submissions in the final five, which would have brought me up to 33rd place – not all that much better than 52nd. All said, this was my most educational Kaggle contest yet. Here are some things I’ll take into the next contest:

  • It is easier to overfit the public leaderboard than previously thought. Be more selective with submissions.
  • On a related note, perform cross-validation the right way: include all training (feature selection, preprocessing, etc.) in each fold.
  • Try to ignore the public leaderboard, even when it is telling you nice things about yourself.

Some sample code

One of my best submissions (average precision = .86294) was actually one of my own benchmarks that took very little thought. By stacking this with two other models (random forests and elastic net), I was able to get it up to .86334. Since the single model is pretty simple, I’ve included the code. After imputing missing values in the training and test set with medians, I used the gbm package in R to fit a boosting model using every column in the data as predictor. The hyperparameters were not tuned at all, just some reasonable starting values. I used the internal cross-validation feature of gbm to choose the number of trees. The full code from start to finish is below:

library(gbm)

# impute.NA is a little function that fills in NAs with either means or medians
impute.NA <- function(x, fill="mean"){
  if (fill=="mean")
  {
    x.complete <- ifelse(is.na(x), mean(x, na.rm=TRUE), x)
  }

  if (fill=="median")
  {
    x.complete <- ifelse(is.na(x), median(x, na.rm=TRUE), x)
  }

  return(x.complete)
}

data <- read.table("Psychopath_Trainingset_v1.csv", header=T, sep=",")
testdata <- read.table("Psychopath_Testset_v1.csv", header=T, sep=",")

# Median impute all missing values
# Missing values are in columns 3-339
fulldata <- apply(data[,3:339], 2, FUN=impute.NA, fill="median")
data[,3:339] <- fulldata

fulltestdata <- apply(testdata[,3:339], 2, FUN=impute.NA, fill="median")
testdata[,3:339] <- fulltestdata

# Fit a generalized boosting model

# Create a formula that specifies that psychopathy is to be predicted using
# all other variables (columns 3-339) in the dataframe

gbm.psych.form <- as.formula(paste("psychopathy ~", 
                                   paste(names(data)[c(3:339)], collapse=" + ")))

# Fit the model by supplying gbm with the formula from above. 
# Including the train.fraction and cv.folds argument will perform 
# cross-validation 

gbm.psych.bm.1 <- gbm(gbm.psych.form, n.trees=5000, data=data,
                      distribution="gaussian", interaction.depth=6,
                      train.fraction=.8, cv.folds=5)

# gbm.perf will return the optimal number of trees to use based on 
# cross-validation. Although I grew 5,000 trees, cross-validation suggests that
# the optimal number of trees is about 4,332.

best.cv.iter <- gbm.perf(gbm.psych.bm.1, method="cv") # 4332

# Use the trained model to predict psychopathy from the test data. 

gbm.psych.1.preds <- predict(gbm.psych.bm.1, newdata=testdata, best.cv.iter)

# Package it in a dataframe and write it to a .csv file for uploading.

gbm.psych.1.bm.preds <- data.frame(cbind(myID=testdata$myID, 
                                         psychopathy=gbm.psych.1.preds))

write.table(gbm.psych.1.bm.preds, "gbmbm1.csv", sep=",", row.names=FALSE)

Created by Pretty R at inside-R.org

 

 

|   

Highlights from my trip to useR! 2012

Frank Harrell giving opening comments to start off useR! 2012

 

This was the first year that I was able to attend useR!, the annual conference for R users, and I’m really glad I did. I had to make some tough choices when picking focus sessions, and I surely missed some of the good focus talks. Some of cool things I did get to see included:

  • Max Kuhn held an afternoon tutorial session on the caret package (caret is short for Classification And REgression Training). The caret package is a life-saver if you do any kind of predictive modeling. I had already known a little about the package, but I learned a lot form the tutorial. It turns out that caret does a lot of things that my own little custom R functions do (plus a LOT more), and caret does everything much more efficiently and easily. I also learned that e1071 is a reference to a course name and not some top secret project.
  • David Kahle demonstrated some cool features of the ggmap package. The ggmap package takes a lot of the headache out of grabbing a map from various web sources and using them as backdrops for ggplot2 plots. It also has some great functions for getting all kinds of geographical information directly from the R command line.
  • Yihui Xie, who would most likely to be voted the star of UseR! 2012, was everywhere – giving a focus talk, a lightning talk, and an invited talk – talking about several different contributions to R and doing so with a great sense of humor. It was hard to go a couple hours at the conference without hearing someone mentioning knitr, Yihui’s awesome package for making beautiful, reproducible reports even easier in R (or any other language). His presentation of cranvas, which can quickly create interactive plots of millions of points, was stunning, too. Most importantly, he managed to get “ALL YOUR BASE ARE BELONG TO US” on the giant screen in Langford Auditorium.
  • The lightning talk session was one of my favorite parts of the conference. Each presenter gets 5 minutes to present 15 slides, each which automatically advances every 20 seconds. Tim Hesterberg shared some details about his dataframe and aggregation packages in a lightning talk and elaborated further on them in an invited talk. Adopting the dataframe package seems like a no-brainer for most R users: simply install it and load it at the beginning of your R session and forget about it, all while enjoying a modest speed boost across a variety of R tasks. The performance increase comes from reducing the amount of copies of data made by R during common operations, and you can read more about the magic behind the packages at Tim’s site.
  • Jim Harner presented the rknn package, an R implementation of a random k-nearest neighbors approach for regression and classification. From the benchmarks provided, rknn looked to be a solid contender with random forests in both speed and predictive performance. I’ll definitely be trying this out in the near future.
  • The special invited discussion on “What other languages should R users know about?” confirmed my nagging suspicions about Julie and Python: I really have to learn them. Douglas Bates walked the audience through some Julia code (only after a nice musical intro), showing off some very impressive speed, and Chris Fonnesbeck showed off Python via the IPython notebook. After seeing the dataframe-like structures from the Pandas library, the barrier to trying out Python looks a little lower than I thought. To R/Python users: what is the closest Python equivalent of RStudio?

Coming to useR! this year also gave me a chance to spend some time in Nashville, a place that I came to love while living there throughout grad school. Walking around the peaceful Vanderbilt campus at night was the perfect way to mellow out after several hours of R each day. It will be a little more difficult to make it to useR! 2013, which will be held at the University of Castilla-La Mancha in Albacete, Spain, but I’ll do my best to make it!

Vanderbilt’s Peabody College campus at sunset

Don’t use the formula interface with randomForest()

Recently I was using randomForest() in R with a dataset containing about 8000 predictors. With smaller datasets, I could get away with using the formula interface (e.g., y ~ x1 + x2 … + x5) when fitting a model, but in this case, I couldn’t even get randomForest() to finish after handing it a giant formula.

Some googling led me to a great post about randomForest() by Chris Raimondi on the Heritage Health Prize forums that advised against the formula interface. Some posts at StackOverflow gave similar admonishments. And, of course, the manual for randomForest() suggests against it in cases with many predictors.

The alternative to the formula interface is to supply your response and predictors separately using the’ y=’ and ‘x=’ arguments and column indices. The toy example below demonstrates the gains from using this alternative. With 1000 predictors, randomForest() finishes in about 28 seconds with the formula interface and just under 6 seconds using the indexing alternative!

Try the code below to see for yourself:

# Create some data
library(randomForest)

# Create a dataframe with a categorical response and 1000 random continuous predictors
data <- data.frame(y=gl(2, 50, labels = 0:1),
                   matrix(rnorm(1e05), nrow=length(y), 
                   ncol=1000)) 

# Add some names
names(data) <- c("y", paste("X", c(1:1000), sep=""))

# Create a formula to describe the model. my.rf.formula is 'y ~ X1 + X2 + .... X999 + X1000'
my.rf.formula <- as.formula(paste("y ~ ", paste(names(data)[-1], collapse=" + "), sep="")) 

# First, run randomForest with the formula interface
system.time(rf.1 <- randomForest(my.rf.formula, mtry=10, ntree=2000, data=data, type="classification"))
# elapsed time: 28.24 s

# Now try it using column indicies 
system.time(rf.2 <- randomForest(y=data[,1], x=data[,-1], mtry=10, ntree=2000, type="classification"))
# elapsed time: 5.97 s

Created by Pretty R at inside-R.org

My magical Baader-Meinhof moment at the Fulton Street station

As I left the office this evening and headed towards the subway station, I turned on the most recent episode of one of my favorite new podcasts: Philosophy Bites. Very early in the discussion between the hosts, the guest philosopher Alain de Botton introduces the Greek word akrasia:

The ancient greeks have this wonderful word, akrasia – weakness of will. Arguments and positions can be very interesting and convincing, but we all suffer from weakness of will. And to combat weakness of will, you need training – mental training.

I hear this as I’m walking down the staircase into the subway station, half of my mind attending to the discussion and the other half trying to visualize the word akrasia in Greek letters. I pop through the turnstiles, peek down the tracks, then turn towards the wall behind me to see this:

Akrasia

Bizarre. This is easily my best personal example of the Baader-Meinhof Phenomenon.

How I Didn’t Do It: Finishing 36th in the Don’t Get Kicked! Competition.

My first run in a Kaggle competition finally ended recently with the closing of the “Don’t Get Kicked!” contest. It was an excellent learning experience and I had a lot of fun in the process (at the expense of sleep). I was pleasantly surprised to have ended up in 36th place out of 588 teams.

The task in this competition was to predict “bad buys” among thousands of cars purchased through online auctions. Contestants were given information on attributes such as purchase price, make and model of the car, trim level, odometer reading, date of purchase, state of origin, and so on. The original/training data included about 40 variables (including the “bad buy” outcome) on roughly 70,000 cars, and the test data had the same information on about 40,000 cars (minus the outcome). Contestants submit predictions, in the form of probabilities, about which of these 40,000 cars will turn out to be bad buys. Predictions are instantly scored against reality, and each set of predictions is scored using the Gini coefficient.

I submitted 30 sets of predictions over the course of the competition, using a variety of methods. Below I’ve plotted my results over time, along with the type of method behind each submission. The dashed grey line along the top is the score of the winning submission.

full size

Notice the big jump around October 10th? That was due to a novel insight: reading the competition directions. I had been submitting predicted classes (0s and 1s), not predicted probabilities between 0 and 1. Here’s the same plot, excluding the erroneous submissions:

full size

My best score resulted from a combination of a lot of boosted trees, random forests, and additive models. I used a weighted average of the predictions from each method, choosing the weights using cross-validation. Everything was done in R: boosting with the gbm package, random forests with the randomForests package, and additive models with the gam package.

My basic strategy was to create lots of plots and descriptive statistics, then create useful transformations and combinations of variables that could be fed into models as predictors. My predictor set grew from about 10 to 51 over the course of the competition:

full size

I had created about 150 predictors in the end but only about 50 proved to be useful. As you can see below, boosting was superior to most other methods given the same predictors.

full size

While a lot of fun, I’m glad to take a break from Kaggling this weekend and look forward to seeing what the winners did!

|   

Struck by Kaggle

In the last five months, I’ve done few things that have made updating this blog a lot harder than anticipated: moved to New York, started a new job, got a dog, and discovered Kaggle. What’s Kaggle? Perhaps the best description comes from the Kaggle’s own site:

Kaggle is an innovative solution for statistical/analytics outsourcing. We are the leading platform for predictive modeling competitions. Companies, governments and researchers present datasets and problems – the world’s best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model.

Competing on Kaggle is as easy as registering and uploading a .csv. Feedback on submissions is instant, and a public leaderboard constantly updated as participants submit. The available competitions provide nice range of problems to tackle, including predicting insurance claims, mapping dark matter, and modeling edits to Wikipedia.

Before I had ever heard of Kaggle, back in early 2010, I participated in my first modeling/prediction contest through the now-defunct Analytics X competition. In that contest, the goal was to predict the spatial distribution of homicides in Philadelphia. I spent many hours working on a single submission for that competition, and I remember being horrified by how poorly it ranked. Participant submissions were ranked according to the amount of prediction error (RMSE, I think), and mine was abysmally low.  My precious mutlilevel model was a total flop, and based on the relatively low error rates from everyone else on the public leaderboard, all of my fellow competitors were using some kind of voodoo to make their predictions. Confused and dismayed, I stopped working on the Analytics X competition and focused on finishing graduate school.

My Analytics X experience continued to haunt me. What happened? I did all of the right things according to my statistics textbooks! So, I started poking around for clues to my (and my model’s) failure. I began following discussions at r/MachingLearning and Cross-Validated, picked up Berk’s gentle Statistical Learning from a Regression Perspective, struggled through a lot of The Elements of Statistical Learning (free pdf here!).  At some point, I stumbled upon a wonderful presentation (video link here) titled “Getting in Shape for the Sport of Data Science” by Jeremy Howard, Kaggle’s President and Chief Scientist, and things started to take shape. Armed with a few new tricks, I was determined to give the prediction game another go.

Since then, I’ve entered two Kaggle competitions (the Photo Quality Prediction, which ended recently, and Don’t Get Kicked!, which is still ongoing) and have had a lot of fun. I’m very far from even placing a competition, but I’m fairing much better thanks to the tools and techniques outside of the standard social science toolbox.

I can’t overstate the value of approaching statistical modeling with the goal of accurate prediction rather than hypothesis testing, even if hypothesis testing is your job. The weaknesses of the traditional social science methods (linear regression, generalized linear models) become obvious very quickly. You are forced to try new techniques, scour your data for patterns, and get creative with transforming variables. You can’t get too attached to any given model. Instead, you need to be critical with every aspect of your model, do lots of tinkering, and be prepared to trash it if you hit a wall.

Many Kaggle competitors submit 50+ entries throughout a given contest. I’ve already submitted about 25 attempts in a current competition, making small changes each time. This iterative approach encourages a deeper level of understanding of the data than, say, testing a model’s goodness-of-fit or  whether a certain coefficient was different from zero. I think this is counterintuitive to those who assume that the world of machine learning and statistical prediction is all black boxes without much concern for the underlying processes that generated the data.

My experience with Kaggle has left me wondering why predictive accuracy isn’t more important in social science. Sure, good prediction does not equal a good theory, but shouldn’t a good theory result in good predictions? Yet I can’t remember the last time I ever saw any kind of cross-validation in a paper. If there was an academic psychology journal that hosted Kaggle-style competitions between different theoretical camps, I’d read it every month. I’d even pay for it! Until then, I’ll continue reading No Free Hunch.

 

 

 

The Garmin Forerunner 305 – it’s not just for running.

I was lucky enough to receive the awesome gift of a Garmin Forerunner 305 GPS watch recently, and it quickly became an essential piece of running gear. It collects all sorts of fun data from each workout, but my favorite feature is the heartrate monitor, which wirelessly transmits heartrate information to the watch from an elastic band worn on the torso. Having such objective feedback about how hard the heart is working can be very useful info mid-run, but why limit the Garmin 305 to my runs?

I’ve been wanting to play around with the Garmin data in R anyway, so for a fun project, I tracked my heartrate during my oral defense of my dissertation last week. I started tracking about 50 minutes before the defense started and ended shortly after the defense, which lasted a little over two hours. The Garmin 305 saves all workout data as .tcx files, and I used the free software TCX Converter to convert my “workout” file to a .csv. From there, it was pretty straightforward to plot everything out (I used functions from Hadley Wickham’s stringr package and a smoother from Frank Harrell’s Hmisc package; R code below, formatted with Pretty R). Everything went pretty smoothly during the defense (the proposal is generally much tougher, I’m told), and it looks like I calmed down and settled in after the 25 minutes or so. (Edit: My resting heartrate is just under 60 bpm.) Perhaps the most stressful part of the whole process was getting the projector to cooperate with my poor old work laptop for the 20 minutes prior to the presentation. Subjectively, I felt relatively calm during while talking, but I was surprised to see such a drop in my heartrate directly after it was finished.

I really wish I had this kind of data for all my past talks and presentations, because I’m sure my heartrate was skyrocketing during these kind of things five years ago. Training works!

library(stringr)
library(Hmisc)

defense <- read.table("defense.csv", header=T, sep=",")

## Time is in a big string, so str_sub from the stringr library
## pulls out hours, minutes, and seconds
defense$hour <- as.numeric(str_sub(defense$TIME, start=12, end=13))
defense$minute <- as.numeric(str_sub(defense$TIME, start=15, end=16))
defense$second <- as.numeric(str_sub(defense$TIME, start=18, end=19))

## Combine three time variables to create a "total minutes passed" variable
## Time started in the 17:00 hour 
defense$totalminutes <- (defense$hour - 17)*60 + defense$minute + defense$second/60

par(mar=c(5,4,3,2))
plot(defense$totalminutes, defense$HR, pch=16, col="grey95", axes=F,
    ylim=c(50,115), xlim=c(20, 202), ylab="Heartrate (bpm)", xlab="Time\n(Hours:Minutes)")

# Create custom axes, labels and tickmarks
x.axis.ticks <- c(14, 24, 44, 64, 84, 104, 124, 144, 164, 184, 194, 204)
x.axis.labels <- c(" ", "-0:40m", "-0:20m", "0:00", "0:20","0:40","1:00","1:20",
                   "1:40", "2:00", " ", "2:20")
y.axis.ticks <- c(50, 60, 70, 80, 90, 100, 110)
y.axis.labels <- c(50, 60, 70, 80, 90, 100, 110)
axis(side=2, pos=14, las=2,lty=1, lwd=.5, at=y.axis.ticks,
     labels=y.axis.labels, tck=-.02, cex.axis=1.0)
axis(side=1, pos=50, las=0,lty=1, lwd=.5, at=x.axis.ticks,
     labels=x.axis.labels, tck=-.02, cex.axis=1.0)

# Two vertical gridlines with labels
segments(x0=64, x1=64, y0=50, y1=110, lwd=2, col="grey60", lty=1)
text(x=64, y=113, "Defense\nstarts", cex=.9)
segments(x0=194, x1=194, y0=50, y1=110, lwd=2, col="grey60", lty=1)
text(x=194, y=113, "Defense\nends", cex=.9)

# Create horizontal gridlines
for (i in seq(60,110,10))
  {
    segments(x0=14, x1=204, y0=i, y1=i, lwd=1, col="grey70", lty=1)
  }

# Just for fun, bootstrap a loess smoother (using plsmo() from Hmisc library)
for (i in 1:1000)
  {
    myboot <- sample(1:nrow(defense), replace=TRUE)
    bootdata <- defense[myboot,]
    plsmo(bootdata$totalminutes, bootdata$HR, col="grey70", f=.05, add=TRUE)
  }

# Draw one overlapping thick red smoothed line
plsmo(x=defense$totalminutes, y=defense$HR, col="red3",lty=1,
        lwd=5, f=.05, add=TRUE, datadensity=F)

# Label various events

arrows(x0=34, y0=95, x1=15, y1=83, lwd=2, angle=20, length=.1)
text(x=34, y=97, "Arrive at office", cex=.7)

arrows(x0=36, y0=57, x1=36, y1=62, lwd=2, angle=20, length=.1)
text(x=36, y=54, "Projector\nbroken...", cex=.7