Kaggle Post-Mortem: Psychopathy Prediction Blowout

Can Twitter expose psychopathy?

Over the last month and a half, the Online Privacy Foundation hosted a Kaggle competition, in which competitors attempted to predict psychopathy scores based on abstracted Twitter activity from a couple thousand users. One of the goals of the competition is to determine how much information about one’s personality can be extracted from Twitter, and by hosting the competition on Kaggle, the Online Privacy Foundation can sit back and watch competitors squeeze every bit of predictive ability out of the data, trying to predict the psychopathy scores of 1,172 Twitter users. Competitors can submit two sets of predictions each day, and each submission is scored from 0 (worst) to 1 (best) using a metric known as “average precision“. Essentially, a submission that predicts the correct ranking of psychopathy scores across all Twitter accounts will receive a score of 1.

Over the course of the contest, I made 42 submissions, making it my biggest Kaggle effort yet. Each submission is instantly scored and competitors are ranked on a public leaderboard according to their best submission. However, the public leaderboard score isn’t actually the “true” score – it is only an estimate based on a small portion of the submission. When the competition ends, five submissions from each competitor are compared to the full set of test data (all 1,172 Twitter accounts), and the highest scoring submission from each user is used to calculate the final score. By the end of the contest, I had slowly worked my way up to 2nd place on the public leaderboard, shown below.

 

Top of the public leaderboard. The public leaderboard scores are calculated during the competition by comparing users’ predictions to a small subset of the test data.

 

I held this spot for the last week and felt confident that I would maintain a decent spot on the private or “true” leaderboard. Soon after the competition closed, the private leaderboard was revealed. Here’s what I saw at the top:

Top of the private leaderboard. The private leaderboard is the “real” leaderboard, revealed after the contest is closed. Scores are calculated by comparing users’ predictions to the full set of test data.

 

Where’d I go? I scrolled down the leaderboard…. further… and further… and finally found my name:

 

My place on the private leaderboard. I dropped from 2nd place on the public leaderboard to 52nd on the private leaderboard. Notice I placed below the random forest benchmark!

 

Somehow I managed to fall from 2nd all the way down to 52nd! I wasn’t the only one who took a big fall: the top five users on the public leaderboard ended up in 64th, 52nd, 58th, 16th, and 57th on the private leaderboard, respectively. I even placed below the random forest benchmark, a solution publicly available from the start of the competition.

 

What happened?

After getting over the initial shock of dropping 50 places, I began sifting through the ashes to figure out what went so wrong. I think those with more experience already know, but one clue is in the screenshot of the pack leaders on the public leaderboard. Notice that the top five users, including myself, have a lot of submissions. For context, the median number of submissions in this contest was six. Contrast this with the (real) leaders on the private leaderboard – most have less than 12 submissions. Below, I’ve plotted the number of entries from each user against their final standing on the public and private leaderboards and added a trend line to each plot.

 

 

On the public leaderboard, more submissions are consistently related to a better standing.  It could be that the public leaderboard actually reflects the amount of brute force from a competitor rather than predictive accuracy. If you throw enough mud at a wall, eventually some of it will start to stick. The problem is that submissions that score well using this approach probably will not generalize to the full set of test data when the competition closes. It’s possible to overfit the portion of test data used to calculate the public leaderboard, and it looks like that’s exactly what I did.

Compare that trend in the public leaderboard to the U-shaped curve in the plot of the private leaderboard and number of submissions. After about 25 submissions or so, private leaderboard standings get worse with the number of submissions.

Poor judgments under uncertainty

Overfitting the public leaderboard is not unheard of, and I knew that it was a possibility all along. So why did I continue to hammer away at competition with so many submissions, knowing that I could be slowly overfitting to the leaderboard?

Many competitors with estimate the quality of their submissions prior to uploading them using cross-validation. Because the public leaderboard is only based on a small portion of the test data, it is only a rough estimate of the true quality of a submission, and cross-validation gives a sort of second opinion of a submission’s quality. For most of my submissions, I used 10-fold cross-validation to estimate the average precision. So throughout the contest, I could observe both the public leaderboard score and my own estimated score from cross-validation. After the contest closes, Kaggle reveals the private or “true” score of each submission. Below, I’ve plotted the public, private, and cv-estimated score of each submission by date.

 

Scores of my 42 submissions to the Psychopathy Prediction contest over time. Each submission has three corresponding scores: a public score, a private score, and a score estimated using 10-fold cross validation (cv). The public score of a submission is calculated instantly upon submission by comparing a subset of the predictions to a corresponding subset of the test data. The private score is the “true” score of a submission (based on the entire set of test data) and is not observable until the contest is finished. Every submission has a public and private score, but a few submissions are missing an CV-estimated score. The dotted line is the private score of the winning submission.

 

There are a few things worth pointing out here:

  • My cross-validation (CV) estimated scores (the orange line) gradually improve over time. So, as far as I know, my submissions are actually getting better as I go.
  • The private or “true” scores actually get worse over time. In fact, my first two submissions to the contest turned out to be my best (and I did not choose them as any of my final five submissions)
  • The public scores reach a peak and then slowly get worse at the end of the contest.
  • It is very difficult to see a relationship between any of these trends.

Below, I’ve replaced the choppy lines with smoothed lines to show the general trends.

 

An alternate plot of the submission scores over time using smoothers to depict some general trends.

 

Based on my experience with past contests, I knew that the public leaderboard could not be trusted fully, and this is why I used cross-validation. I assumed that there was a stronger relationship between the cross-validation estimates and the private leaderboard than between the public and private leaderboard. Below, I’ve created scatterplots to show the relationships between each pair of score types.

 

Scatterplots of three types of scores for each submission: 10-fold CV estimated score, public leaderboard score, and private or “true” score.

 

The scatterplots tell a different story. It turned out that my cross-validation estimates were not related to the private scores at all (notice the horizontal linear trends in those scatterplots), and the public leaderboard wasn’t any better.  I already guessed that the public leaderboard would be a poor estimate of the true score, but why didn’t cross-validation do any better?

I suspect this is because as the competition went on, I began to use much more feature selection and preprocessing. However, I made the classic mistake in my cross-validation method by not including this in the cross-validation folds (for more on this mistake, see this short description or section 7.10.2 in The Elements of Statistical Learning).  This lead to increasingly optimistic cross-validation estimates. I should have known better, but under such uncertainty, I fooled myself into accepting the most self-serving description of my current state. Even worse, I knew not to trust the public leaderboard, but when I started to edge towards to the top, I began to trust it again!

Lessons learned

In the end, my slow climb up the leaderboard was due mostly to luck. I chose my five final submissions based on cross-validation estimates, which turned out to be a poor predictor of true score. Ultimately, I did not include my best submissions in the final five, which would have brought me up to 33rd place – not all that much better than 52nd. All said, this was my most educational Kaggle contest yet. Here are some things I’ll take into the next contest:

  • It is easier to overfit the public leaderboard than previously thought. Be more selective with submissions.
  • On a related note, perform cross-validation the right way: include all training (feature selection, preprocessing, etc.) in each fold.
  • Try to ignore the public leaderboard, even when it is telling you nice things about yourself.

Some sample code

One of my best submissions (average precision = .86294) was actually one of my own benchmarks that took very little thought. By stacking this with two other models (random forests and elastic net), I was able to get it up to .86334. Since the single model is pretty simple, I’ve included the code. After imputing missing values in the training and test set with medians, I used the gbm package in R to fit a boosting model using every column in the data as predictor. The hyperparameters were not tuned at all, just some reasonable starting values. I used the internal cross-validation feature of gbm to choose the number of trees. The full code from start to finish is below:

library(gbm)

# impute.NA is a little function that fills in NAs with either means or medians
impute.NA <- function(x, fill="mean"){
  if (fill=="mean")
  {
    x.complete <- ifelse(is.na(x), mean(x, na.rm=TRUE), x)
  }

  if (fill=="median")
  {
    x.complete <- ifelse(is.na(x), median(x, na.rm=TRUE), x)
  }

  return(x.complete)
}

data <- read.table("Psychopath_Trainingset_v1.csv", header=T, sep=",")
testdata <- read.table("Psychopath_Testset_v1.csv", header=T, sep=",")

# Median impute all missing values
# Missing values are in columns 3-339
fulldata <- apply(data[,3:339], 2, FUN=impute.NA, fill="median")
data[,3:339] <- fulldata

fulltestdata <- apply(testdata[,3:339], 2, FUN=impute.NA, fill="median")
testdata[,3:339] <- fulltestdata

# Fit a generalized boosting model

# Create a formula that specifies that psychopathy is to be predicted using
# all other variables (columns 3-339) in the dataframe

gbm.psych.form <- as.formula(paste("psychopathy ~", 
                                   paste(names(data)[c(3:339)], collapse=" + ")))

# Fit the model by supplying gbm with the formula from above. 
# Including the train.fraction and cv.folds argument will perform 
# cross-validation 

gbm.psych.bm.1 <- gbm(gbm.psych.form, n.trees=5000, data=data,
                      distribution="gaussian", interaction.depth=6,
                      train.fraction=.8, cv.folds=5)

# gbm.perf will return the optimal number of trees to use based on 
# cross-validation. Although I grew 5,000 trees, cross-validation suggests that
# the optimal number of trees is about 4,332.

best.cv.iter <- gbm.perf(gbm.psych.bm.1, method="cv") # 4332

# Use the trained model to predict psychopathy from the test data. 

gbm.psych.1.preds <- predict(gbm.psych.bm.1, newdata=testdata, best.cv.iter)

# Package it in a dataframe and write it to a .csv file for uploading.

gbm.psych.1.bm.preds <- data.frame(cbind(myID=testdata$myID, 
                                         psychopathy=gbm.psych.1.preds))

write.table(gbm.psych.1.bm.preds, "gbmbm1.csv", sep=",", row.names=FALSE)

Created by Pretty R at inside-R.org

 

 

|   

11 Responses to “Kaggle Post-Mortem: Psychopathy Prediction Blowout”

  1. Great article Gregory!

    I found a significant source of error in my learners was the imputing part. Using my benchmark algorithm, on (almost) complete data I could score .861, while on imputed data I could only manage .854. In total, about 16% of the data was missing from the data set, not trivial stuff.

    btw, do you ever use Python or Matlab?

    • greg says:

      That’s really interesting to hear, and I’m curious how everyone else handled the missing data. Like you, I was surprised at the amount and patterns of missingness in the training and test sets. I spent some time trying to determine whether missingness itself was related to the outcome, and I never found solid evidence that it was. Still, my method of imputing the missing data was really quick and dirty, and I probably should have tried some kind of regression or nn-based approach. Next time…

      Also, I haven’t tried using anything but R for Kaggle yet. I think I’ll have to force myself to use Python exclusively for a future contest. No experience at all with Matlab!

  2. Zach Mayer says:

    You mentioned that you stacked a random forest and an elastic net model, and then provided code for a gbm model. Out of curiosity, how did you stack your two models? Would you be interested in posting the code you used to do so?

    I ask, because I’ve found that stacking is often prone to over-fitting and must be done very carefully, so I’m curious to see what your technique was.

    • greg says:

      Zach,

      I’d like to make a separate post in the future about my attempts at stacking, mostly because I’m still learning about it as well, and I have had inconsistent results with it so far. I only posted the gbm because it performed relatively well and the code is fairly simple compared to the submissions using stacking (I combined a gbm, glmnet, and random forest). I experimented with different “stacker” models, but combining the three models with a linear model (lm) performed best on the private leaderboard.

      • Zach Mayer says:

        What did you use as “training” data for your stacker? Simply the predictions from your gbm, rf, glmnet models?

        • greg says:

          Basically, but it is important to only use held-out predictions. I used 10-fold cross-validation for each (level-1) model individually and kept the predictions from each held-out fold. Repeat this for every model. Ultimately, you’ll have one column of held-out predictions for each model, and these columns can be used as predictors or “training” data in your level-2/stacker/blender model.

  3. Great analysis. I dropped 60 spots on the Biological Response contest, public to private. It makes me feel better knowing other people make the same mistake.

  4. Thanks SO much for this analysis. I’m yet to submit to any kaggle contests but as a beginner, this sort of insight into your seasoned experience of a contest is invaluable! Rock on and awesome write up.

  5. [...] Esse post do Gredogy Park é um ótimo exemplo de como o Overfitting pode prejudicar uma análise, e… Gostar disso:GosteiSeja o primeiro a gostar disso. Etiquetado Cross-Validation, Overfitting [...]

  6. [...] have since worked with us in writing up the results, including Gregory Park who wrote this blog post about his experiences in the [...]

  7. This is a really great post, Greg. How are you defining preprocessing? I ask because I’m a little surprised that your preprocessing steps should contribute to overfitting. Did you change your preprocessing pipeline in response to your place on the public leaderboard? In that case, it seems like your preprocessing is effectively another form of feature selection or even dynamic feature generation.

Leave a Reply