Random forest regression: drop-column importance

nikos_geo · April 9, 2024, 11:32am

I am running a random forest regression (RFR) task and I want to apply the Drop-column importance strategy. The basic idea of this strategy is:

to get a baseline performance score as with permutation importance, but then drop a column entirely, retrain the model, and recompute the performance score. The importance value of a feature is then the difference between the baseline and the score from the model missing that feature.

I found this strategy here and here.

Using the ranger package, how can I implement the above strategy and so in the end I could have the final model with the most important predictors (based on the above strategy) and maybe print the variables?

library(ranger)

train.idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris.train <- iris[train.idx, ]
iris.test <- iris[-train.idx, ]

rg.iris <- ranger(Species ~ ., 
              data = iris.train, 
              num.trees = 101, 
              importance = "permutation")

Windows 11, R 4.3.3, RStudio 2023.12.1 Build 402.

Max · April 9, 2024, 1:45pm

You absolutely have to cross-validate/resample the elimination process to get performance estimates that are not overly optimistic (called "selection bias" in regards to feature selection).

We've written a ton about this and there are a lot of references that back that up, the best being Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data. It's relevant even if you are not using the same type of data.

caret probably has the best interface for this right now (but we are working on it with tidymodels).

system · April 30, 2024, 1:45pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.