Author Archives: greg

Heatmap of the hottest final digits of historical football scores.

I heard about a friendly office pool based on the outcome of the Super Important Internationally Televised Football Championship happening this weekend. Rather than betting on the actual winner of the game, everyone picks a handful of combination of two single digits, like (2, 1) or (9, 3). Nobody can have pick the same combinations, but if I choose (2, 1), then (1, 2) is still up for grabs. Each digit corresponds to the last digit in each team’s final score, so if I choose (2, 1) and the final score is 42-21, I win.

It’s reasonable to think that some combinations are more likely than others, but since I’m not very familiar with football, I didn’t have a good sense of what combinations would make the best bets. Naturally, I had to throw a bunch of data into R and make a graph! Below you can see the probabilities of all possible combinations. The data is from 14,417 professional football scores found on http://www.pro-football-reference.com.

Not all of the scores are from NFL games — there are some scores from AFPA, the AAFC, the AFL in here, too. Maybe this will add some funny outcomes in there if they had different scoring systems. Still, the results look reasonable at first glance. Scores ending in 3, 4, and 7 seem to be the best bets. That (7, 0) is so hot right now. (7, 0). R code is below.

FinalDigitsPlot

 

library(stringr)
library(ggplot2)

data <- read.table("nflscores.csv", header=T, sep=",") # I grabbed all scores from http://www.pro-football-reference.com, 

## This is gonna get ugly

data <- cbind(data, t(matrix(as.numeric(unlist(str_split(data$Score, pattern="-"))), nrow=2, ncol=997)))

names(data) <- c("rank", "score", "ptsw", "ptsl", "pttot", "ptdif", "count", "lastdate", "score1", "score2")

data$d1 <- data$score1 %% 10 # get last digits
data$d2 <- data$score2 %% 10

scores.df <- expand.grid(lastdigit1=c(0:9), # create a grid of all combinations
                         lastdigit2=c(0:9))

scores.df$count <- NA

## I warned you!

for(i in 0:9){   # loop through 
  for(j in 0:9){
      scores.df$count[which(scores.df$lastdigit1==i & scores.df$lastdigit2==j)] <- sum(data$count[which(data$d1==i & data$d2==j)])
  }
}

scores.df$prob <- scores.df$count / sum(scores.df$count) # prob = freq of a given outcome/total number of games

svg("FinalDigitsPlot.svg", width=450, height=390)
score.plot <- ggplot(data=scores.df, aes(x=lastdigit1, y=lastdigit2, colour=prob))

score.plot + geom_point(shape=15, size=15) + # rather than use heatmap(), I plotted squares colored by prob
        theme_set(theme_bw()) +              
        scale_colour_gradient(name="Probability", low="grey70", high="red") +
        scale_x_continuous(limits=c(0, 9), breaks=0:9) +
        scale_y_continuous(limits=c(0, 9), breaks=0:9) + 
        labs(x="Last digit (winner)", y="Last digit (loser)") + 
        ggtitle("Final Digits from 14,417 Pro Football Games") 
dev.off()

Created by Pretty R at inside-R.org

 

Pursuing the good life

If you were to ask a random stranger on the street to name a psychologist, it’s a safe bet that most would name Sigmund Freud. If you went a step further and asked what Freud was all about, they might respond with something like “All of your problems come from events that happened early in childhood” or “Problems are all rooted in past sexual frustrations”, or something along those lines. Unfortunately, this is what many people still think the entire field of psychology is about: our problems and their origins in the past.

But people are more than the sum of their past problems and weaknesses. People have strengths beyond physical and cognitive domains, like strengths of character: courage, empathy, and creativity. Rather than being strictly determined by the past, people can learn from experience, construct new goals, and use these as a guide to a better future. When tragedy strikes, most people recover naturally over time – some even grow from the experience. This side of people, the positive side, has received far less attention by psychologists during the last century.

Positive psychology is the small but quickly growing field of psychology that takes a scientific approach to understanding this side of us. Researchers in the field tackle broad questions about the nature of happiness (or “well-being”), how to measure it, and its causes and effects. Other questions include: are there reliable ways to become happier? (Spoiler: Yes.) Do optimism and a positive outlook really have an effect on our health? What are the connections between our sense of well-being and our experiences, possessions, relationships, and personality?

One of the pioneers in this field is psychologist Christopher Peterson, who defined the field as the scientific study of what goes right in life. Positive psychology, he argued, can’t just be about feel-good, gee-whiz fluff. It will be successful and useful only to the extent that it relies on scientific evidence. Peterson literally wrote the book on positive psychology and wrote extensively on happiness, character strengths, optimism, and the good life, both in academic journals and in his blog at Psychology Today.

The field suffered an immense loss when Peterson passed away suddenly in October. Fortunately, he left one final gift. Pursuing the Good Life: 100 Reflections on Positive Psychology is a collection of one hundred short pieces by Peterson, many of them reworked and updated since first appearing on his blog. It’s the perfect introduction to a fascinating field and a glimpse into the heart and mind of one of its founders.

 

 

P.S. I never actually got to meet Chris Peterson (he passed away just before I had the chance), but I had become familiar with much of his work as I learned about the field of positive psychology. One of his academic articles sticks out as a favorite of mine: a quick two-page piece on methodology, not positive psychology, titled “Minimally Sufficient Research”. In a nutshell, he reminds readers that many of the important findings in psychology came from extremely simple designs, and warns that knowing how to use complex methods is not a justification for using them. It’s a great little article that I’ve come back to several times when I needed some grounding. Read it here.

A miracle occurred (p < .10)

While reading an an article about statistical procedures from American Psychologist, I came across the following in a footnote

Interestingly, Enrico Fermi, the great physicist, thought p = .10 to be the wise operational definition of a “miracle” (Polanyi, 1961) …

Intrigued, I found the cited Polanyi article in a 1962 issue of journal Philosophy. Discussing the notion of unreasonable chances, Polyani notes that

The late Enrico Fermi is reported to have said that a miracle is an event the chances of which are less than one in ten. (page 3)

Without any source to follow, I tried Googling to find the original source of the “Fermi miracle” probability figure. This lead to a footnote in collection titled Cosmochemical Evolution and the Origins of Life, this time summarizing a comment by Harold Urey

The evolution of life is not a miracle in the ordinary sense of the word, but it might be a ‘Fermi-Miracle’. A first order Fermi-Miracle is defined as an event that occurs with a probability of 1/10. A Fermi-Miracle n-th order is an event that occurs with the probability of (1/10)^n. (page 9)

I also found other passing references in forum discussions to the miracle (p < .10) idea, each time attributed to Fermi, and one reference even lowered the threshold of a Fermi-Miracle to .2. But just as multiple sources seemed to converge, I came across the following quote attributed to Fermi, allegedly sourced to Owen Chamberlain,

A three standard deviation is a statistical fluctuation; a five standard deviation effect is a miracle.

This is, of course, a much more stringent cutoff, and it seems impossible that Fermi could have fluctuated between the p = .10 and the “five standard deviation” (or p = .0000003, roughly) threshold. Given that there are so many more references to the p = .10 cutoff, I find it more believably attributable to Fermi than the “five standard deviation” rule, but I’m surprised that I wasn’t able to nail it down to a sourced quote from Fermi. I suspect that there are many out there that are familiar enough with Fermi to clear this up, right?

Kaggle Post-Mortem: Psychopathy Prediction Blowout

Can Twitter expose psychopathy?

Over the last month and a half, the Online Privacy Foundation hosted a Kaggle competition, in which competitors attempted to predict psychopathy scores based on abstracted Twitter activity from a couple thousand users. One of the goals of the competition is to determine how much information about one’s personality can be extracted from Twitter, and by hosting the competition on Kaggle, the Online Privacy Foundation can sit back and watch competitors squeeze every bit of predictive ability out of the data, trying to predict the psychopathy scores of 1,172 Twitter users. Competitors can submit two sets of predictions each day, and each submission is scored from 0 (worst) to 1 (best) using a metric known as “average precision“. Essentially, a submission that predicts the correct ranking of psychopathy scores across all Twitter accounts will receive a score of 1.

Over the course of the contest, I made 42 submissions, making it my biggest Kaggle effort yet. Each submission is instantly scored and competitors are ranked on a public leaderboard according to their best submission. However, the public leaderboard score isn’t actually the “true” score – it is only an estimate based on a small portion of the submission. When the competition ends, five submissions from each competitor are compared to the full set of test data (all 1,172 Twitter accounts), and the highest scoring submission from each user is used to calculate the final score. By the end of the contest, I had slowly worked my way up to 2nd place on the public leaderboard, shown below.

 

Top of the public leaderboard. The public leaderboard scores are calculated during the competition by comparing users’ predictions to a small subset of the test data.

 

I held this spot for the last week and felt confident that I would maintain a decent spot on the private or “true” leaderboard. Soon after the competition closed, the private leaderboard was revealed. Here’s what I saw at the top:

Top of the private leaderboard. The private leaderboard is the “real” leaderboard, revealed after the contest is closed. Scores are calculated by comparing users’ predictions to the full set of test data.

 

Where’d I go? I scrolled down the leaderboard…. further… and further… and finally found my name:

 

My place on the private leaderboard. I dropped from 2nd place on the public leaderboard to 52nd on the private leaderboard. Notice I placed below the random forest benchmark!

 

Somehow I managed to fall from 2nd all the way down to 52nd! I wasn’t the only one who took a big fall: the top five users on the public leaderboard ended up in 64th, 52nd, 58th, 16th, and 57th on the private leaderboard, respectively. I even placed below the random forest benchmark, a solution publicly available from the start of the competition.

 

What happened?

After getting over the initial shock of dropping 50 places, I began sifting through the ashes to figure out what went so wrong. I think those with more experience already know, but one clue is in the screenshot of the pack leaders on the public leaderboard. Notice that the top five users, including myself, have a lot of submissions. For context, the median number of submissions in this contest was six. Contrast this with the (real) leaders on the private leaderboard – most have less than 12 submissions. Below, I’ve plotted the number of entries from each user against their final standing on the public and private leaderboards and added a trend line to each plot.

 

 

On the public leaderboard, more submissions are consistently related to a better standing.  It could be that the public leaderboard actually reflects the amount of brute force from a competitor rather than predictive accuracy. If you throw enough mud at a wall, eventually some of it will start to stick. The problem is that submissions that score well using this approach probably will not generalize to the full set of test data when the competition closes. It’s possible to overfit the portion of test data used to calculate the public leaderboard, and it looks like that’s exactly what I did.

Compare that trend in the public leaderboard to the U-shaped curve in the plot of the private leaderboard and number of submissions. After about 25 submissions or so, private leaderboard standings get worse with the number of submissions.

Poor judgments under uncertainty

Overfitting the public leaderboard is not unheard of, and I knew that it was a possibility all along. So why did I continue to hammer away at competition with so many submissions, knowing that I could be slowly overfitting to the leaderboard?

Many competitors with estimate the quality of their submissions prior to uploading them using cross-validation. Because the public leaderboard is only based on a small portion of the test data, it is only a rough estimate of the true quality of a submission, and cross-validation gives a sort of second opinion of a submission’s quality. For most of my submissions, I used 10-fold cross-validation to estimate the average precision. So throughout the contest, I could observe both the public leaderboard score and my own estimated score from cross-validation. After the contest closes, Kaggle reveals the private or “true” score of each submission. Below, I’ve plotted the public, private, and cv-estimated score of each submission by date.

 

Scores of my 42 submissions to the Psychopathy Prediction contest over time. Each submission has three corresponding scores: a public score, a private score, and a score estimated using 10-fold cross validation (cv). The public score of a submission is calculated instantly upon submission by comparing a subset of the predictions to a corresponding subset of the test data. The private score is the “true” score of a submission (based on the entire set of test data) and is not observable until the contest is finished. Every submission has a public and private score, but a few submissions are missing an CV-estimated score. The dotted line is the private score of the winning submission.

 

There are a few things worth pointing out here:

  • My cross-validation (CV) estimated scores (the orange line) gradually improve over time. So, as far as I know, my submissions are actually getting better as I go.
  • The private or “true” scores actually get worse over time. In fact, my first two submissions to the contest turned out to be my best (and I did not choose them as any of my final five submissions)
  • The public scores reach a peak and then slowly get worse at the end of the contest.
  • It is very difficult to see a relationship between any of these trends.

Below, I’ve replaced the choppy lines with smoothed lines to show the general trends.

 

An alternate plot of the submission scores over time using smoothers to depict some general trends.

 

Based on my experience with past contests, I knew that the public leaderboard could not be trusted fully, and this is why I used cross-validation. I assumed that there was a stronger relationship between the cross-validation estimates and the private leaderboard than between the public and private leaderboard. Below, I’ve created scatterplots to show the relationships between each pair of score types.

 

Scatterplots of three types of scores for each submission: 10-fold CV estimated score, public leaderboard score, and private or “true” score.

 

The scatterplots tell a different story. It turned out that my cross-validation estimates were not related to the private scores at all (notice the horizontal linear trends in those scatterplots), and the public leaderboard wasn’t any better.  I already guessed that the public leaderboard would be a poor estimate of the true score, but why didn’t cross-validation do any better?

I suspect this is because as the competition went on, I began to use much more feature selection and preprocessing. However, I made the classic mistake in my cross-validation method by not including this in the cross-validation folds (for more on this mistake, see this short description or section 7.10.2 in The Elements of Statistical Learning).  This lead to increasingly optimistic cross-validation estimates. I should have known better, but under such uncertainty, I fooled myself into accepting the most self-serving description of my current state. Even worse, I knew not to trust the public leaderboard, but when I started to edge towards to the top, I began to trust it again!

Lessons learned

In the end, my slow climb up the leaderboard was due mostly to luck. I chose my five final submissions based on cross-validation estimates, which turned out to be a poor predictor of true score. Ultimately, I did not include my best submissions in the final five, which would have brought me up to 33rd place – not all that much better than 52nd. All said, this was my most educational Kaggle contest yet. Here are some things I’ll take into the next contest:

  • It is easier to overfit the public leaderboard than previously thought. Be more selective with submissions.
  • On a related note, perform cross-validation the right way: include all training (feature selection, preprocessing, etc.) in each fold.
  • Try to ignore the public leaderboard, even when it is telling you nice things about yourself.

Some sample code

One of my best submissions (average precision = .86294) was actually one of my own benchmarks that took very little thought. By stacking this with two other models (random forests and elastic net), I was able to get it up to .86334. Since the single model is pretty simple, I’ve included the code. After imputing missing values in the training and test set with medians, I used the gbm package in R to fit a boosting model using every column in the data as predictor. The hyperparameters were not tuned at all, just some reasonable starting values. I used the internal cross-validation feature of gbm to choose the number of trees. The full code from start to finish is below:

library(gbm)

# impute.NA is a little function that fills in NAs with either means or medians
impute.NA <- function(x, fill="mean"){
  if (fill=="mean")
  {
    x.complete <- ifelse(is.na(x), mean(x, na.rm=TRUE), x)
  }

  if (fill=="median")
  {
    x.complete <- ifelse(is.na(x), median(x, na.rm=TRUE), x)
  }

  return(x.complete)
}

data <- read.table("Psychopath_Trainingset_v1.csv", header=T, sep=",")
testdata <- read.table("Psychopath_Testset_v1.csv", header=T, sep=",")

# Median impute all missing values
# Missing values are in columns 3-339
fulldata <- apply(data[,3:339], 2, FUN=impute.NA, fill="median")
data[,3:339] <- fulldata

fulltestdata <- apply(testdata[,3:339], 2, FUN=impute.NA, fill="median")
testdata[,3:339] <- fulltestdata

# Fit a generalized boosting model

# Create a formula that specifies that psychopathy is to be predicted using
# all other variables (columns 3-339) in the dataframe

gbm.psych.form <- as.formula(paste("psychopathy ~", 
                                   paste(names(data)[c(3:339)], collapse=" + ")))

# Fit the model by supplying gbm with the formula from above. 
# Including the train.fraction and cv.folds argument will perform 
# cross-validation 

gbm.psych.bm.1 <- gbm(gbm.psych.form, n.trees=5000, data=data,
                      distribution="gaussian", interaction.depth=6,
                      train.fraction=.8, cv.folds=5)

# gbm.perf will return the optimal number of trees to use based on 
# cross-validation. Although I grew 5,000 trees, cross-validation suggests that
# the optimal number of trees is about 4,332.

best.cv.iter <- gbm.perf(gbm.psych.bm.1, method="cv") # 4332

# Use the trained model to predict psychopathy from the test data. 

gbm.psych.1.preds <- predict(gbm.psych.bm.1, newdata=testdata, best.cv.iter)

# Package it in a dataframe and write it to a .csv file for uploading.

gbm.psych.1.bm.preds <- data.frame(cbind(myID=testdata$myID, 
                                         psychopathy=gbm.psych.1.preds))

write.table(gbm.psych.1.bm.preds, "gbmbm1.csv", sep=",", row.names=FALSE)

Created by Pretty R at inside-R.org

 

 

Highlights from my trip to useR! 2012

Frank Harrell giving opening comments to start off useR! 2012

 

This was the first year that I was able to attend useR!, the annual conference for R users, and I’m really glad I did. I had to make some tough choices when picking focus sessions, and I surely missed some of the good focus talks. Some of cool things I did get to see included:

  • Max Kuhn held an afternoon tutorial session on the caret package (caret is short for Classification And REgression Training). The caret package is a life-saver if you do any kind of predictive modeling. I had already known a little about the package, but I learned a lot form the tutorial. It turns out that caret does a lot of things that my own little custom R functions do (plus a LOT more), and caret does everything much more efficiently and easily. I also learned that e1071 is a reference to a course name and not some top secret project.
  • David Kahle demonstrated some cool features of the ggmap package. The ggmap package takes a lot of the headache out of grabbing a map from various web sources and using them as backdrops for ggplot2 plots. It also has some great functions for getting all kinds of geographical information directly from the R command line.
  • Yihui Xie, who would most likely to be voted the star of UseR! 2012, was everywhere – giving a focus talk, a lightning talk, and an invited talk – talking about several different contributions to R and doing so with a great sense of humor. It was hard to go a couple hours at the conference without hearing someone mentioning knitr, Yihui’s awesome package for making beautiful, reproducible reports even easier in R (or any other language). His presentation of cranvas, which can quickly create interactive plots of millions of points, was stunning, too. Most importantly, he managed to get “ALL YOUR BASE ARE BELONG TO US” on the giant screen in Langford Auditorium.
  • The lightning talk session was one of my favorite parts of the conference. Each presenter gets 5 minutes to present 15 slides, each which automatically advances every 20 seconds. Tim Hesterberg shared some details about his dataframe and aggregation packages in a lightning talk and elaborated further on them in an invited talk. Adopting the dataframe package seems like a no-brainer for most R users: simply install it and load it at the beginning of your R session and forget about it, all while enjoying a modest speed boost across a variety of R tasks. The performance increase comes from reducing the amount of copies of data made by R during common operations, and you can read more about the magic behind the packages at Tim’s site.
  • Jim Harner presented the rknn package, an R implementation of a random k-nearest neighbors approach for regression and classification. From the benchmarks provided, rknn looked to be a solid contender with random forests in both speed and predictive performance. I’ll definitely be trying this out in the near future.
  • The special invited discussion on “What other languages should R users know about?” confirmed my nagging suspicions about Julie and Python: I really have to learn them. Douglas Bates walked the audience through some Julia code (only after a nice musical intro), showing off some very impressive speed, and Chris Fonnesbeck showed off Python via the IPython notebook. After seeing the dataframe-like structures from the Pandas library, the barrier to trying out Python looks a little lower than I thought. To R/Python users: what is the closest Python equivalent of RStudio?

Coming to useR! this year also gave me a chance to spend some time in Nashville, a place that I came to love while living there throughout grad school. Walking around the peaceful Vanderbilt campus at night was the perfect way to mellow out after several hours of R each day. It will be a little more difficult to make it to useR! 2013, which will be held at the University of Castilla-La Mancha in Albacete, Spain, but I’ll do my best to make it!

Vanderbilt’s Peabody College campus at sunset

Don’t use the formula interface with randomForest()

Recently I was using randomForest() in R with a dataset containing about 8000 predictors. With smaller datasets, I could get away with using the formula interface (e.g., y ~ x1 + x2 … + x5) when fitting a model, but in this case, I couldn’t even get randomForest() to finish after handing it a giant formula.

Some googling led me to a great post about randomForest() by Chris Raimondi on the Heritage Health Prize forums that advised against the formula interface. Some posts at StackOverflow gave similar admonishments. And, of course, the manual for randomForest() suggests against it in cases with many predictors.

The alternative to the formula interface is to supply your response and predictors separately using the’ y=’ and ‘x=’ arguments and column indices. The toy example below demonstrates the gains from using this alternative. With 1000 predictors, randomForest() finishes in about 28 seconds with the formula interface and just under 6 seconds using the indexing alternative!

Try the code below to see for yourself:

# Create some data
library(randomForest)

# Create a dataframe with a categorical response and 1000 random continuous predictors
data <- data.frame(y=gl(2, 50, labels = 0:1),
                   matrix(rnorm(1e05), nrow=length(y), 
                   ncol=1000)) 

# Add some names
names(data) <- c("y", paste("X", c(1:1000), sep=""))

# Create a formula to describe the model. my.rf.formula is 'y ~ X1 + X2 + .... X999 + X1000'
my.rf.formula <- as.formula(paste("y ~ ", paste(names(data)[-1], collapse=" + "), sep="")) 

# First, run randomForest with the formula interface
system.time(rf.1 <- randomForest(my.rf.formula, mtry=10, ntree=2000, data=data, type="classification"))
# elapsed time: 28.24 s

# Now try it using column indicies 
system.time(rf.2 <- randomForest(y=data[,1], x=data[,-1], mtry=10, ntree=2000, type="classification"))
# elapsed time: 5.97 s

Created by Pretty R at inside-R.org

My magical Baader-Meinhof moment at the Fulton Street station

As I left the office this evening and headed towards the subway station, I turned on the most recent episode of one of my favorite new podcasts: Philosophy Bites. Very early in the discussion between the hosts, the guest philosopher Alain de Botton introduces the Greek word akrasia:

The ancient greeks have this wonderful word, akrasia – weakness of will. Arguments and positions can be very interesting and convincing, but we all suffer from weakness of will. And to combat weakness of will, you need training – mental training.

I hear this as I’m walking down the staircase into the subway station, half of my mind attending to the discussion and the other half trying to visualize the word akrasia in Greek letters. I pop through the turnstiles, peek down the tracks, then turn towards the wall behind me to see this:

Akrasia

Bizarre. This is easily my best personal example of the Baader-Meinhof Phenomenon.

How I Didn’t Do It: Finishing 36th in the Don’t Get Kicked! Competition.

My first run in a Kaggle competition finally ended recently with the closing of the “Don’t Get Kicked!” contest. It was an excellent learning experience and I had a lot of fun in the process (at the expense of sleep). I was pleasantly surprised to have ended up in 36th place out of 588 teams.

The task in this competition was to predict “bad buys” among thousands of cars purchased through online auctions. Contestants were given information on attributes such as purchase price, make and model of the car, trim level, odometer reading, date of purchase, state of origin, and so on. The original/training data included about 40 variables (including the “bad buy” outcome) on roughly 70,000 cars, and the test data had the same information on about 40,000 cars (minus the outcome). Contestants submit predictions, in the form of probabilities, about which of these 40,000 cars will turn out to be bad buys. Predictions are instantly scored against reality, and each set of predictions is scored using the Gini coefficient.

I submitted 30 sets of predictions over the course of the competition, using a variety of methods. Below I’ve plotted my results over time, along with the type of method behind each submission. The dashed grey line along the top is the score of the winning submission.

full size

Notice the big jump around October 10th? That was due to a novel insight: reading the competition directions. I had been submitting predicted classes (0s and 1s), not predicted probabilities between 0 and 1. Here’s the same plot, excluding the erroneous submissions:

full size

My best score resulted from a combination of a lot of boosted trees, random forests, and additive models. I used a weighted average of the predictions from each method, choosing the weights using cross-validation. Everything was done in R: boosting with the gbm package, random forests with the randomForests package, and additive models with the gam package.

My basic strategy was to create lots of plots and descriptive statistics, then create useful transformations and combinations of variables that could be fed into models as predictors. My predictor set grew from about 10 to 51 over the course of the competition:

full size

I had created about 150 predictors in the end but only about 50 proved to be useful. As you can see below, boosting was superior to most other methods given the same predictors.

full size

While a lot of fun, I’m glad to take a break from Kaggling this weekend and look forward to seeing what the winners did!

Struck by Kaggle

In the last five months, I’ve done few things that have made updating this blog a lot harder than anticipated: moved to New York, started a new job, got a dog, and discovered Kaggle. What’s Kaggle? Perhaps the best description comes from the Kaggle’s own site:

Kaggle is an innovative solution for statistical/analytics outsourcing. We are the leading platform for predictive modeling competitions. Companies, governments and researchers present datasets and problems – the world’s best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model.

Competing on Kaggle is as easy as registering and uploading a .csv. Feedback on submissions is instant, and a public leaderboard constantly updated as participants submit. The available competitions provide nice range of problems to tackle, including predicting insurance claims, mapping dark matter, and modeling edits to Wikipedia.

Before I had ever heard of Kaggle, back in early 2010, I participated in my first modeling/prediction contest through the now-defunct Analytics X competition. In that contest, the goal was to predict the spatial distribution of homicides in Philadelphia. I spent many hours working on a single submission for that competition, and I remember being horrified by how poorly it ranked. Participant submissions were ranked according to the amount of prediction error (RMSE, I think), and mine was abysmally low.  My precious mutlilevel model was a total flop, and based on the relatively low error rates from everyone else on the public leaderboard, all of my fellow competitors were using some kind of voodoo to make their predictions. Confused and dismayed, I stopped working on the Analytics X competition and focused on finishing graduate school.

My Analytics X experience continued to haunt me. What happened? I did all of the right things according to my statistics textbooks! So, I started poking around for clues to my (and my model’s) failure. I began following discussions at r/MachingLearning and Cross-Validated, picked up Berk’s gentle Statistical Learning from a Regression Perspective, struggled through a lot of The Elements of Statistical Learning (free pdf here!).  At some point, I stumbled upon a wonderful presentation (video link here) titled “Getting in Shape for the Sport of Data Science” by Jeremy Howard, Kaggle’s President and Chief Scientist, and things started to take shape. Armed with a few new tricks, I was determined to give the prediction game another go.

Since then, I’ve entered two Kaggle competitions (the Photo Quality Prediction, which ended recently, and Don’t Get Kicked!, which is still ongoing) and have had a lot of fun. I’m very far from even placing a competition, but I’m fairing much better thanks to the tools and techniques outside of the standard social science toolbox.

I can’t overstate the value of approaching statistical modeling with the goal of accurate prediction rather than hypothesis testing, even if hypothesis testing is your job. The weaknesses of the traditional social science methods (linear regression, generalized linear models) become obvious very quickly. You are forced to try new techniques, scour your data for patterns, and get creative with transforming variables. You can’t get too attached to any given model. Instead, you need to be critical with every aspect of your model, do lots of tinkering, and be prepared to trash it if you hit a wall.

Many Kaggle competitors submit 50+ entries throughout a given contest. I’ve already submitted about 25 attempts in a current competition, making small changes each time. This iterative approach encourages a deeper level of understanding of the data than, say, testing a model’s goodness-of-fit or  whether a certain coefficient was different from zero. I think this is counterintuitive to those who assume that the world of machine learning and statistical prediction is all black boxes without much concern for the underlying processes that generated the data.

My experience with Kaggle has left me wondering why predictive accuracy isn’t more important in social science. Sure, good prediction does not equal a good theory, but shouldn’t a good theory result in good predictions? Yet I can’t remember the last time I ever saw any kind of cross-validation in a paper. If there was an academic psychology journal that hosted Kaggle-style competitions between different theoretical camps, I’d read it every month. I’d even pay for it! Until then, I’ll continue reading No Free Hunch.

 

 

 

Another look at crime in Chicago

*** The original post and video has been edited to emphasize that the data is actually reports of crimes and not actual crimes. I also added a background music track by Alastair Cameron to the final video, available from Vimeo. ***

About a month ago, Drew Conway posted a fascinating animation based on one year’s worth of crime report data from the City of Chicago. An important feature of the animation is that crime reports are plotted in 10-day windows over the course of a year. I wanted to see how differently things would look if crime reports could accumulate over time, reflecting the idea that crimes can have long-lasting effects on neighborhoods. Luckily for me, Drew posted all of the code used to create the animation and used all open source tools (R, ggplot2, ImageMagick, and ffmpeg), so recreating a modified version of the video only required a few small changes. Try viewing the HD version on full-screen to see all the details.

 

 

Crime reports are placed into 33 different categories in the original dataset, so I simplified things by putting most of the reports into one of three categories: violent, property, and “vice”. For example, violent crime includes (but is not limited to) assault and battery, property crime includes robbery and theft, and vice crime includes narcotics, prostitution, liquor law violations, and gambling. The running daily crime report rate for each of these categories is shown along the bottom of the video. A couple interesting points:

  • There are very few places that are completely untouched by crime reports, but some areas are almost totally free of violent crime reports.
  • As Drew pointed out, crimes drop during the cold season. However, this drop seems to be limited to property and violent crime reports. Cold weather doesn’t slow down vice as much. On the other hand, vice crimes have the lowest rates over the course of the year, so small changes may be hard to visual detect in these plots, even if they are in proportion to the drops in violent and property crimes.
  • Plotting daily crime report counts revealed a few special days. All crimes, especially property crimes, drop on Christmas, but violent crime reports explode on New Year’s Eve. The big drop on February 2nd is most likely due to the 2011 Groundhog Day blizzard that began burying the city on February 1st. A mysterious outlier is April 10th, which was the most violent day of the year (71 assaults, 348 battery reports, and 3 homicides). My searches of Chicago newspapers and blogs from around this date were fruitless. I’d be interested to hear if there is any significance to this date.

Looking at the accumulated crimes, I was curious about the empty spaces and lines that remained. I’ve only been to Chicago a few times and don’t know much about it, so the original boundaries in the map (police districts) are fairly meaningless to me. I needed to get a better sense how the data mapped on to a more human geography of the city. So, to explore the data further, I imported the plot into Google Earth and overlaid the image over satellite imagery of the city. After a bit of resizing, the image fit quite well. Next, by adjusting the transparency of the plot, I was able to see the streets, houses, parks, and neighborhoods that lied under the giant red and green blotches. This revealed a few more interesting bits:

  • The empty rectangular spots in the crime reports map are almost always the locations of public parks and airports. I doubt that these places are actually crime free (in fact, a small number of crimes are located in the heart of some parks), so perhaps crime reports in these areas is recorded differently by the City.
  • Long empty lines in the crime map are generally the location of highways.
  • Abrupt changes in the crime density are associated with geographic barriers. For example, on the southwestern side of the map, there is a sudden drop in crime reports as one goes westward. A closer look in Google Earth shows that this shift maps perfectly with a barrier created by Major Taylor Trail/South Beverly Avenue and I-57.
  • Wrigley Field is filled with blue.

I had a lot of fun looking around Chicago this way, so I put together a second, longer video that combines the animation with a short “tour” of the city and some of the neighborhoods. Of course, there are a lot of happy, fun, and non-criminal things going on in Chicago, but until the City starts releasing data on puppies and hugs in the City, I’m stuck with this. R code is below.

 

 

# This code is a modified version of Drew Conway's code found at
# https://github.com/drewconway/ZIA/tree/master/R/Chicago

# Load libraries
library(ggplot2)
library(maptools)

# Read in Crimes.csv
crimes <- read.csv("Crimes.csv", stringsAsFactors=FALSE)

# # Get count of crimes by day, type, and location.
# # Notice the typo in the header for this data :) 

crimes.by.day <- ddply(crimes, .(DATE..OF.OCCURRENCE, PRIMARY.DECSRIPTION, X.COORDINATE, Y.COORDINATE),
       			summarise, COUNT=length(ARREST))
# 
# # Fix date and sort
crimes.by.day$DATE <- as.Date(crimes.by.day$DATE..OF.OCCURRENCE, format="%m/%d/%Y")
crimes.by.day <- crimes.by.day[with(crimes.by.day, order(DATE)),]
write.csv(crimes.by.day, "crimes_by_day.csv", row.names=FALSE)

# Load in clean data
crimes.by.day <- read.csv("crimes_by_day.csv", stringsAsFactors=FALSE)
crimes.by.day$DATE <- as.Date(crimes.by.day$DATE)

# Find the daily count of each crime type for each day 
# cumul.crimes has the daily cumulative crimes for each primary crime type

cumul.crimes <- ddply(crimes.by.day, .(PRIMARY.DECSRIPTION, DATE), summarise, DAYCOUNT=sum(COUNT),
                      .progress="text")
cumul.crimes <- cumul.crimes[with(cumul.crimes, order(DATE)),]

# Creater broader categories of crimes 

# Violent Crimes only
vio_crimes <- c("BATTERY", "ASSAULT", "CRIM SEXUAL ASSAULT", "HOMICIDE", "INTIMIDATION")

# Property Crimes only
prop_crimes <- c("THEFT","CRIMINAL DAMAGE", "BURGLARY", "MOTOR VEHICLE THEFT", "ROBBERY", "CRIMINAL TRESPASS")

# "Vice" crimes only
vice_crimes <- c("NARCOTICS", "PROSTITUTION", "GAMBLING", "LIQUOR LAW VIOLATION")

cumul.crimes$MAJORTYPE <- NA

cumul.crimes$MAJORTYPE <- with(cumul.crimes, ifelse(match(PRIMARY.DECSRIPTION, vio_crimes) & is.na(MAJORTYPE),
                                                    "VIOLENT", MAJORTYPE))

cumul.crimes$MAJORTYPE <- with(cumul.crimes, ifelse(match(PRIMARY.DECSRIPTION, prop_crimes) & is.na(MAJORTYPE),
                                                    "PROPERTY", MAJORTYPE))

cumul.crimes$MAJORTYPE <- with(cumul.crimes, ifelse(match(PRIMARY.DECSRIPTION, vice_crimes) & is.na(MAJORTYPE),
                                                    "VICE", MAJORTYPE))

cumul.crimes <- subset(cumul.crimes, !is.na(MAJORTYPE))

# Total crimes is used for line plot
total.crimes <- ddply(cumul.crimes, .(DATE, MAJORTYPE),
                      summarise, CUMDAYCOUNT=sum(DAYCOUNT), .progress="text")
total.crimes <- total.crimes[with(total.crimes, order(DATE)),]

total.crimes$CUMYEARCOUNT <- NA

total.crimes$CUMYEARCOUNT[which(total.crimes$MAJORTYPE=="PROPERTY")] <- with(total.crimes,
                                                                             cumsum(CUMDAYCOUNT[which(MAJORTYPE=="PROPERTY")]))

total.crimes$CUMYEARCOUNT[which(total.crimes$MAJORTYPE=="VIOLENT")] <- with(total.crimes,
                                                                             cumsum(CUMDAYCOUNT[which(MAJORTYPE=="VIOLENT")]))

total.crimes$CUMYEARCOUNT[which(total.crimes$MAJORTYPE=="VICE")] <- with(total.crimes,
                                                                         cumsum(CUMDAYCOUNT[which(MAJORTYPE=="VICE")]))

total.crimes$MAJORTYPE <- as.factor(total.crimes$MAJORTYPE)

# read in shapefile of Chicago Police Department's (CPD) Beats
cpd.shp <- readShapePoly("cpd_beats.shp")	

## fortify.SpatialPolygons() will convert a Spatial Polygon into a regular dataframe of coordinates

cpd.df <- fortify.SpatialPolygons(cpd.shp)

# major.crimes.by.day is used for new geospatial plot

crimes.by.day$MAJORTYPE <- NA

crimes.by.day$MAJORTYPE <- with(crimes.by.day, ifelse(match(PRIMARY.DECSRIPTION, vio_crimes) & is.na(MAJORTYPE),
                                                    "VIOLENT", MAJORTYPE))

crimes.by.day$MAJORTYPE <- with(crimes.by.day, ifelse(match(PRIMARY.DECSRIPTION, prop_crimes) & is.na(MAJORTYPE),
                                                    "PROPERTY", MAJORTYPE))

crimes.by.day$MAJORTYPE <- with(crimes.by.day, ifelse(match(PRIMARY.DECSRIPTION, vice_crimes) & is.na(MAJORTYPE),
                                                    "VICE", MAJORTYPE))

major.crimes.by.day <- subset(crimes.by.day, !is.na(MAJORTYPE))

# This loop will create the ggplots for the map and line plots for each day in date.range

date.range <- seq.Date(as.Date(min(major.crimes.by.day$DATE)), as.Date(max(major.crimes.by.day$DATE)), "days")

day <- 1

mycolors <- c("PROPERTY"= "darkolivegreen3","VICE"="dodgerblue", "VIOLENT"="firebrick1")

for(day in 1:length(date.range))
{

  line.plot  <- ggplot(subset(total.crimes, DATE <= date.range[day]), aes(x=DATE, y=CUMDAYCOUNT, group=MAJORTYPE)) +
                  geom_line(aes(colour=MAJORTYPE), lwd=1)+
                  theme_bw()+
                  scale_colour_manual(name="Type of Crime", values=mycolors,
                                     breaks=c(1,3,2), labels=c("Property", "Violent", "Vice"),
                                     legend=FALSE) +
                  xlab("")+ ylab("Crimes Per Day") +
                  opts(axis.title.y = theme_text(colour="black", size = 10, hjust=1, angle=90)) +
                  scale_y_continuous(limits=c(0, 700))+
                  scale_x_date(limits=c(as.Date("2010-06-01", format="%Y-%m-%d"),
                                            as.Date("2011-07-01", format="%Y-%m-%d")),
                                            format="%b\n%Y", major="1 month")

    day.plot <- ggplot(cpd.df, aes(x=long, y=lat)) +geom_path(aes(group=group))  

		day.plot <- day.plot+
                theme_bw()+
                opts(title=paste("Cumulative Crime in Chicago\n",
							    	strftime(min(major.crimes.by.day$DATE), format="%b %d, %Y"),"to",
                                 strftime(date.range[day], format="%b %d, %Y")),
							    	panel.grid.major=theme_blank(),
                    panel.grid.minor=theme_blank(),
								    axis.text.x=theme_blank(),
                    axis.text.y=theme_blank(),
                    axis.ticks=theme_blank())

	if(nrow(cpd.df) > 0)  {

    day.plot <- day.plot+
                  geom_point(data=subset(major.crimes.by.day, DATE <= date.range[day]),
                               aes(x=X.COORDINATE, y=Y.COORDINATE,
                               color=MAJORTYPE, size=5, alpha=.01))+
                  theme_bw()+
  			          scale_size(legend=FALSE, to=c(.7))+
                  scale_alpha(legend=FALSE, to=c(.01, .8))+
				          scale_colour_manual(name="Type of Crime", values=mycolors,
                                    breaks=c("PROPERTY","VIOLENT", "VICE"),
                                    labels=c("Property", "Violent", "Vice"))+
                  opts(panel.grid.major = theme_blank())+
                  opts(panel.grid.minor = theme_blank())+
  								opts(axis.text.x = theme_blank())+
                  opts(axis.text.y = theme_blank())+
                  opts(axis.ticks = theme_blank()) +
                  opts(legend.position = c(.87,.78))+
                  opts(legend.text = theme_text(size = 14)) +
                  opts(legend.title = theme_text(size = 16, face = "bold", hjust = 0))+
                  opts(legend.key.size = unit(2, "lines"))+
                  xlab("") + ylab("") 

    grob <- ggplotGrob(day.plot)

    grob <- geditGrob(grob, size=unit(8, "mm"), "key.points", grep=T) 

    grid.newpage()

    grid.draw(grob)
	}

	# Save image

  png(filename=paste("maps/", day, ".png", sep=""), units="in",
      res=300, height=9, width=10)
  grid.draw(grob)
  dev.off()

	ggsave(plot=line.plot,	filename=paste("timelines/", day, "_time.png", sep=""),
			height=2, width=10)

	day <- day + 1

  print(day)
}

## standalone plots

## Daily Crime Rate plot

 line.plot  <- ggplot(subset(total.crimes, DATE <= date.range[358]), aes(x=DATE, y=CUMDAYCOUNT, group=MAJORTYPE)) +
                  geom_line(aes(colour=MAJORTYPE), lwd=1)+
                  theme_bw()+
                  scale_colour_manual(name="Type of Crime", values=mycolors,
                                     breaks=c(1,3,2), labels=c("Property", "Violent", "Vice"),
                                     legend=FALSE) +
                  xlab("")+ ylab("Crimes Per Day") +
                  opts(axis.title.y = theme_text(colour="black", size = 10, hjust=.5, angle=90)) +
                  scale_y_continuous(limits=c(0, 700))+
                  scale_x_date(limits=c(as.Date("2010-06-01", format="%Y-%m-%d"),
                                            as.Date("2011-07-01", format="%Y-%m-%d")),
                                            format="%b\n%Y", major="1 month")

## Map of crimes plot

    day.plot <- ggplot(cpd.df, aes(x=long, y=lat)) +geom_path(aes(group=group))  

		day.plot <- day.plot+
                theme_bw()+
                opts(title=paste("Crime in Chicago\n June 20, 2010 to June 13, 2011"),
							    	panel.grid.major=theme_blank(),
                    panel.grid.minor=theme_blank(),
								    axis.text.x=theme_blank(),
                    axis.text.y=theme_blank(),
                    axis.ticks=theme_blank())

    day.plot <- day.plot+
                  geom_point(data=major.crimes.by.day,
                               aes(x=X.COORDINATE, y=Y.COORDINATE,
                               color=MAJORTYPE, size=5, alpha=.01))+
                  theme_bw()+
  			          scale_size(legend=FALSE, to=c(.7))+
                  scale_alpha(legend=FALSE, to=c(.01, .8))+
				          scale_colour_manual(name="Type of Crime", values=mycolors,
                                    breaks=c("PROPERTY","VIOLENT", "VICE"),
                                    labels=c("Property", "Violent", "Vice"))+
                  opts(panel.grid.major = theme_blank())+
                  opts(panel.grid.minor = theme_blank())+
  								opts(axis.text.x = theme_blank())+
                  opts(axis.text.y = theme_blank())+
                  opts(axis.ticks = theme_blank()) +
                  opts(legend.position = c(.87,.78))+
                  opts(legend.text = theme_text(size = 14)) +
                  opts(legend.title = theme_text(size = 16, face = "bold", hjust = 0))+
                  opts(legend.key.size = unit(2, "lines"))+
                  xlab("") + ylab("") 

    grob <- ggplotGrob(day.plot)

    grob <- geditGrob(grob, size=unit(8, "mm"), "key.points", grep=T) 

    grid.newpage()

    grid.draw(grob)

Created by Pretty R at inside-R.org