Dear Community,
I kindly ask you to help me to understand the usage of step_smote in binary classification problem.
I will report here my case study and i hope that could be useful for other people.
So before to start i import the dataset as follow
library(tidyverse)
library(tidymodels)
library(readxl)
library(themis)
dati <- read.csv("dataset.csv")
then i check if i have a unbalanced dataset which is my case
dati %>%
dplyr::group_by(Man) %>%
count()
# A tibble: 2 × 2
# Groups: Man [2]
Man n
<fct> <int>
1 Bio 2481
2 Conv 6561
as you see i have a unbalanced dataset.
so i will try recipe to build up a recipe to use it after in my workflows.
In this case i have used:
- step_impute_bag to impute NA values
- step_smote to solve the unbalanced dataset problem
- step_zv to remove some bad columns with zero variance
- step_scale to standardize the dataset
recipe <- recipe(formula(paste("Man", "~ .")),
data = train_data) %>%
step_impute_bag(all_predictors()) %>%
themis::step_smote(Man, over_ratio = 0.5) %>%
step_zv(all_predictors()) %>%
step_scale(all_predictors())
Now i want to see the results of my preprocess operations.
preprocess_train <- prep(recipe, training = train_data)
train_data_processed <- bake(preprocess_train, new_data = train_data)
train_data_processed %>%
dplyr::group_by(Man) %>%
count()
# A tibble: 2 × 2
# Groups: Man [2]
Man n
<fct> <int>
1 Bio 2481
2 Conv 6561
it's seems that the step_smote is not working properly.
Am i doing something wrong?
I will report here below my entire workflows, please check the content cause the recipe is a part of entire workflow in tidymodels
split<-initial_split(dati , prop = 0.8)
training<-training(split)
test<-testing(split)
set.seed(123)
dataset_cv <- vfold_cv(training, v=5)
set.seed(123)
recipe <- recipe(formula(paste("Man", "~ .")),
data = training) %>%
step_impute_bag(all_predictors()) %>%
themis::step_smote(Man) %>%
step_zv(all_predictors()) %>%
step_scale(all_predictors())
random_forest <-rand_forest(trees = tune(),
mtry = tune(),
min_n = tune()) %>%
set_mode("classification") %>%
set_engine("randomForest")
workflow_random_forest <- workflow() %>%
add_recipe(recipe) %>%
add_model(random_forest)
rf_grid <- grid_latin_hypercube(finalize(mtry(), training),
trees(),
min_n(),
size = 5)
rf_tuned <- tune_grid(object = workflow_random_forest,
resamples = dataset_cv,
grid = rf_grid,
metrics = metric_set(accuracy),
control = control_grid(verbose = TRUE))
best_params_rand_forest <- rf_tuned %>%
select_best(metric = "accuracy")
rf_workflow <- workflow_random_forest %>%
finalize_workflow(best_params_rand_forest)
final_model_rf <- fit(rf_workflow, training)
Thanks to all that will help me to understand where i am making mistakes.