Help with smote and cross validation.

Vincenzo99 · January 22, 2022, 3:36pm

I I'm trying to do cross validation for the logistics function.
I used smote on the training set and and left the test test obtained by initial split.
When I get to make the final estimate of the error, what argument do I put in last_fit? I cannot put the initial splits because otherwise consider the initial training set without smote. I attach the code below:

#initial split

pok_split=initial_split(pokemon,prop=3/4)
pok_train=training(pok_split)
pok_test=testing(pok_split)
pok_train_fold=vfold_cv(pok_train,v=5)

#train with smote
#train smote e fold

pok_smote_train=smote(be_legendary~.,pok_train,perc.over=2,perc.under=5.65) #2 e #5.65 
frequenze_smote_train=count(pok_smote_train,be_legendary)
pok_test
pok_smote_fold=vfold_cv(pok_smote_train,v=5)

#logistic function with cross validation

logistica_recipe1=recipe(be_legendary~. ,data=pok_smote_train)
specifica_logistica1=logistic_reg()%>%set_engine("glm")    

work_logistica1=workflow()%>%
  add_model(specifica_logistica1)%>%
  add_recipe(logistica_recipe1)

logistica_fit1=work_logistica1%>%fit_resamples(resamples=pok_smote_fold,control=control_resamples(save_pred = TRUE))

metriche_log_smote= collect_metrics(logistica_fit1,metric="accuracy")

si_logistica1= collect_predictions(logistica_fit1)

#########
migliore_metrica_log_smote= select_best(logistica_fit1,metric="accuracy")

work_finale_logistica1= work_logistica1%>%finalize_workflow(migliore_metrica_log_smote)

log_smote_fit_finale= work_finale_logistica1%>%last_fit(pok_split)

metriche_finali_log_smote= log_smote_fit_finale%>%collect_metrics()

predizioni_finali=collect_predictions(log_smote_fit_finale)

matrice_confusione_log_smote=table(predizioni_finali$.pred_class,pok_test$be_legendary)

rgregg · January 28, 2022, 5:13pm

I think using the Themis packages's step_smote() function would fix this. I would replace the recipe you have with:

logistica_recipe1 = recipe(be_legendary~. ,data=pok_train) %>% step_smote(be_legendary)

Now when you call fit_resamples() or last_fit() the model will always perform the smote algorithm on the data.

system · February 18, 2022, 5:13pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.