Caret - recursive feature elimination (with upsamling?)

I'm not able to find the right answer in Applied Predictive Modelling or Caret documentation but maybe you guys could help.

What would be the right way of doing RFE on a highly imbalanced classification problem so that the procedure could learn well patterns describing the minority class? The problem I facing right now is that with 5:95 target class ratio the outcome of RFE is not really representative of the patterns that I'm trying to discover. I couldn't find any possibility to use upsamling in the rfe function itself. Is there another way of doing that?

1 Like

Have you considered doing the upsampling before the RFE? Upsample the data, and then pass it to RFE?

That wouldn't really make sense in the resampling context since model performance estimates would be estimated at least party of the same observations the model was built on, right? So it would need to happen within resampling itself if I'm not mistaken.

1 Like

Yes. This is shown pretty well on the caret page for subsampling.

All of the RFE methods in caret are based on function modules such as lmFuncs. You would have to make a copy of that and edit the fit part to run one of the sampling functions (like caret:::downSample) on the data just prior to the fit.



I tried something like this:

rf_fit <- function(x, y, first, last, ...){

df_up <- caret::upSample(x, y)

select(df_up, -Class),
importance = (first | last),

new_rf <- rfFuncs

new_rf$summary <- rf_stats
new_rf$fit <- rf_fit

but from the model performance point of view (sensitivity vs. specificity) I don't see any difference. I'm I doing something wrong here?

No, that looks fine to me. I've had more luck with down-sampling that up.

Also, since it is random forest, you can have the model internally down-sample the data for each tree.


I can confirm - it worked with down-sampling pretty well Thanks again for your great help!

1 Like

Dear konradino,

great! You seem to be the only one who asked for this. I am also trying to change the rfFuncs to do sampling inside the rfe.
However I do not know what your formula rf_stats means. Because of that I cannot include it in new_rf$summary.
Finally, this leads to the rfe not working because it does not recognize -Class.
Sorry for asking but I am a total noob.
Whats your formula for rf_stats?

Dear konradino,
I really would love to hear from regarding my problem. It would be extremely helpful in my data set.

He didn't show that function but it is a function that calculated model performance. See the documentation for details on what this should be.

Dear Max,

first of all thank you for this brilliant package and your effort to answer all questions around it. I love caret! It is build so that even I (a surgeon by trade) am able to understand and use it.
According to konradino's code I changed rfFuncs to:

#up rfe
new_rf_up <- rfFuncs
new_rf_up$summary <- twoClassSummary
new_rf_up$fit <- function(x, y, first, last, ...){

df_up <- caret::upSample(x, y)

df_up[ , names(df_up) != "Class"],
importance = (first | last),

control.up <- rfeControl(functions=new_rf_up, method="cv", repeats=10, verbose = FALSE,seeds=NULL)
rfe.unbal.silver.up <- rfe(unbal_train_silver[,c(1:41,43:57)], unbal_train_silver[,42], sizes=c(1:57), metric = "Spec", summary=twoClassSummary, rfeControl=control.up, trainControl = fitControl)

As mentioned by you here:
I can change the formula to use ROC as performance measure. Then I wanted to use metric="Spec", which should work when specifying twoClassSummary.

However I get following error message:
Error in { : task 1 failed - "argument is of length zero"

What is wrong?

Thank you again!
Markus Schoenberg

do you possibly have a solution for my problem here?
Thank you!
Markus Schoenberg