understanding the step_smote in tidymodels

fiorepalombina · March 13, 2024, 6:50pm

Dear Community,
I kindly ask you to help me to understand the usage of step_smote in binary classification problem.

I will report here my case study and i hope that could be useful for other people.

So before to start i import the dataset as follow

library(tidyverse)
library(tidymodels)
library(readxl)
library(themis)

dati <- read.csv("dataset.csv")

then i check if i have a unbalanced dataset which is my case

dati %>%
  dplyr::group_by(Man) %>%
  count()

# A tibble: 2 × 2
# Groups:   Man [2]
  Man       n
  <fct> <int>
1 Bio    2481
2 Conv   6561

as you see i have a unbalanced dataset.

so i will try recipe to build up a recipe to use it after in my workflows.

In this case i have used:

step_impute_bag to impute NA values
step_smote to solve the unbalanced dataset problem
step_zv to remove some bad columns with zero variance
step_scale to standardize the dataset

recipe <- recipe(formula(paste("Man", "~ .")),
                 data = train_data) %>%
  step_impute_bag(all_predictors()) %>%
  themis::step_smote(Man, over_ratio = 0.5) %>%
  step_zv(all_predictors()) %>%
  step_scale(all_predictors())

Now i want to see the results of my preprocess operations.

preprocess_train <- prep(recipe, training = train_data)
train_data_processed <- bake(preprocess_train, new_data = train_data)

train_data_processed %>%
  dplyr::group_by(Man) %>%
  count()

# A tibble: 2 × 2
# Groups:   Man [2]
  Man       n
  <fct> <int>
1 Bio    2481
2 Conv   6561

it's seems that the step_smote is not working properly.

Am i doing something wrong?

I will report here below my entire workflows, please check the content cause the recipe is a part of entire workflow in tidymodels

split<-initial_split(dati , prop = 0.8)

training<-training(split)
test<-testing(split)

set.seed(123)

dataset_cv <- vfold_cv(training, v=5)

set.seed(123)

recipe <- recipe(formula(paste("Man", "~ .")),
                     data = training) %>%
      step_impute_bag(all_predictors()) %>%
      themis::step_smote(Man) %>%
      step_zv(all_predictors()) %>%
      step_scale(all_predictors())

random_forest <-rand_forest(trees = tune(),
                                mtry = tune(),
                                min_n = tune()) %>%
      set_mode("classification") %>%
      set_engine("randomForest")

workflow_random_forest <- workflow() %>%
      add_recipe(recipe) %>%
      add_model(random_forest)

rf_grid <- grid_latin_hypercube(finalize(mtry(), training),
                                    trees(),
                                    min_n(),
                                    size = 5)

rf_tuned <- tune_grid(object = workflow_random_forest,
                          resamples = dataset_cv,
                          grid = rf_grid,
                          metrics = metric_set(accuracy),
                          control = control_grid(verbose = TRUE))

best_params_rand_forest <- rf_tuned %>%
      select_best(metric = "accuracy")

rf_workflow <- workflow_random_forest %>%
      finalize_workflow(best_params_rand_forest)

final_model_rf <- fit(rf_workflow, training)

Thanks to all that will help me to understand where i am making mistakes.

marcelo_carvalho · March 15, 2024, 4:05am

You could try new_data = NULL instead of new_data = train_data

Example

library(tidymodels)
library(themis)


circle_example %>% 
  count(class)


rec_circle_example <- 
  recipe(class~ x+y, data = circle_example) %>% 
  step_smote(class) %>% 
  prep()

bake(rec_circle_example, new_data = NULL) %>% 
  count(class)

fiorepalombina · March 15, 2024, 7:00am

Thanks @marcelo_carvalho ,

So in order to train the model to avoid the unbalanced dataset, should I use the following code?

library(tidyverse)
library(tidymodels)
library(readxl)
library(themis)

set.seed(123)

dati <- read.csv("dataset.csv") 

dati_split <- initial_split(dati, prop = 0.8, strata = "Man")

train_data <- training(dati_split)
test_data <- testing(dati_split)

recipe <- recipe(formula(paste("Man", "~ .")),
                 data = dati) %>%
  step_zv(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  step_impute_bag(all_predictors()) %>%
  step_smote(Man)%>%
  prep()

dataset_cv <- vfold_cv(train_data, v=5, strata = "Man")

random_forest <-rand_forest(trees = tune(),
                            mtry = tune(),
                            min_n = tune()) %>%
  set_mode("classification") %>%
  set_engine("randomForest")

workflow_random_forest <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(random_forest)

rf_grid <- grid_latin_hypercube(finalize(mtry(), train_data),
                                trees(),
                                min_n(),
                                size = 10)

rf_tuned <- tune_grid(object = workflow_random_forest,
                      resamples = dataset_cv,
                      grid = rf_grid,
                      metrics = metric_set(accuracy, roc_auc, sensitivity, specificity),
                      control = control_grid(verbose = TRUE))

collect_metrics(rf_tuned)

best_params_rand_forest <- rf_tuned %>%
 select_best(metric = "sensitivity")

rf_workflow <- workflow_random_forest %>%
 finalize_workflow(best_params_rand_forest)

final_model_rf <- fit(rf_workflow, train_data)

system · April 26, 2024, 7:00am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.