Rose - K-fold cross validation with Random Forest, pls help me.

Ugur · February 7, 2022, 10:31am

Hi everyone,

i am new in R Studio. I want to write a code;

i separate data set train and test
i want to use ROSE sampling with 10 times k-fold cross validation (train data set).
after all, i predict class with Random forest.

This is my R code, please help me? Thank you.

data2<-read.csv("C:/Users/ugur/Desktop/R/ff.csv", header=TRUE)
str(data2)
data2$churn<-as.factor(data2$churn)
summary(data2)
set.seed(123)
ind <- sample(2, nrow(data2), replace = TRUE, prob = c(0.7, 0.3))
train <- data2[ind==1,]
test <- data2[ind==2,]
table(train$churn)
table(test$churn)
summary(test)
library(ROSE)
fit<-trainControl(data=ROSE(churn~., data =train , N = 4950, seed=111),method="repeatedcv", number=5, search="random", repeats="3", savePredictions=T)

#modelfitrose<-ROSE(churn~., data = , N = 4950, seed=111)$data
library(randomForest)
modelfitrose<-randomForest(churn~.,data=fit)

library(caret)
library(e1071)
confusionMatrix(predict(modelfitrose,test),test$churn,positive='1')

mattwarkentin · February 7, 2022, 8:29pm

Hi @Ugur,

You have not provided your data, so I have created an example using the mtcars dataset that is included in R. I have also chosen to use the modern {tidymodels} infrastructure to essentially replicate the approach shown above which uses {caret}, which has sort of been superseded by `tidymodels, even though it hasn't been deprecated, per se.

library(tidymodels)
library(themis)

set.seed(12345)

# Prepare some example data
data <-
  mtcars %>% 
  mutate(vs = factor(vs))

# Split the data 70/30 for training/testing
train_test_split <- initial_split(data, 0.7, strata = 'vs')

# Extract training data
train <- training(train_test_split)

# Create 5-fold 3-repeat CV splits
cv_folds <- vfold_cv(train, v = 5, repeats = 3, strata = 'vs')

# Simple recipe with ROSE
rec <- recipe(vs ~ ., data = train) %>% 
  step_rose(vs)

# Specify model
model <- rand_forest(
  mode = 'classification',
  engine = 'randomForest',
  mtry = tune(),
  trees = tune(),
  min_n = tune()
)

# Define workflow
wflow <- workflow(
  preprocessor = rec,
  spec = model
)

# Perform CV
res <-
  tune_grid(
    object = wflow,
    resamples = cv_folds,
    control = control_grid(save_pred = TRUE)
  )

# Find the best hyperparameters for the randomForest model
best_hyperparams <- select_best(res)

# Add those hyperparams to the workflow
final_wflow <- finalize_workflow(wflow, best_hyperparams)

# Fit the updated workflow to the whole train data
# Evaluate performance in the held-out test data
final_fit <- last_fit(final_wflow, train_test_split)

# Confusion matrix in the test data
collect_predictions(final_fit) %>% 
  conf_mat(vs, .pred_class)
#>           Truth
#> Prediction 0 1
#>          0 4 2
#>          1 2 3

This should provide a strong starting point to modify for your project.

Ugur · February 8, 2022, 7:55am

Thank you Matt. I will read your post today and i write you again. This very valuable thank you so much again @mattwarkentin Also I send data

mattwarkentin · February 8, 2022, 3:48pm

No need to share your data. I suggest trying to adapt the code I shared for your own project.

Max · February 8, 2022, 7:37pm

Also, it is really important to subsample at the right time. The caret manual talks a lot about this.

@mattwarkentin is right though, we really can't do much unless we know more about the data.

system · March 2, 2022, 11:00am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.