Don’t use the formula interface with randomForest()
Recently I was using randomForest() in R with a dataset containing about 8000 predictors. With smaller datasets, I could get away with using the formula interface (e.g., y ~ x1 + x2 … + x5) when fitting a model, but in this case, I couldn’t even get randomForest() to finish after handing it a giant formula.
Some googling led me to a great post about randomForest() by Chris Raimondi on the Heritage Health Prize forums that advised against the formula interface. Some posts at StackOverflow gave similar admonishments. And, of course, the manual for randomForest() suggests against it in cases with many predictors.
The alternative to the formula interface is to supply your response and predictors separately using the’ y=’ and ‘x=’ arguments and column indices. The toy example below demonstrates the gains from using this alternative. With 1000 predictors, randomForest() finishes in about 28 seconds with the formula interface and just under 6 seconds using the indexing alternative!
Try the code below to see for yourself:
# Create some data library(randomForest) # Create a dataframe with a categorical response and 1000 random continuous predictors data <- data.frame(y=gl(2, 50, labels = 0:1), matrix(rnorm(1e05), nrow=length(y), ncol=1000)) # Add some names names(data) <- c("y", paste("X", c(1:1000), sep="")) # Create a formula to describe the model. my.rf.formula is 'y ~ X1 + X2 + .... X999 + X1000' my.rf.formula <- as.formula(paste("y ~ ", paste(names(data)[-1], collapse=" + "), sep="")) # First, run randomForest with the formula interface system.time(rf.1 <- randomForest(my.rf.formula, mtry=10, ntree=2000, data=data, type="classification")) # elapsed time: 28.24 s # Now try it using column indicies system.time(rf.2 <- randomForest(y=data[,1], x=data[,-1], mtry=10, ntree=2000, type="classification")) # elapsed time: 5.97 s