should I apply recipe to new data in a tidymodel?

rdataforge · June 10, 2021, 11:49am

I have not clearly understood the tidymodel workflow as it seems on my results.

I have trained a successful xgboost model for binary classification. But when I call predict function on new data, an error arised asking for the target variable (disease, not disease)

predict(disease_wf_model, new_incoming_data[1,] )
Error: Can't subset columns that don't exist.
x Column `disease` doesn't exist.

I suppose the new data has no such variable, so i am asking:

should I execute prep() when defining the recipe or not? (some examples do, some other not)
should I execute the recipe on the new data to predict?

Thanks in advance

Max · June 10, 2021, 7:53pm

It would help a lot to see all of the code.

The object has "wf" in it. Is it a workflow object?

rdataforge · June 10, 2021, 9:56pm

Found the issue. The recipe had a string2factor step for target column so when applying it to new data, the target variable was not found.

Something like this:

recipe <- recipes::recipe(disease ~ ., data = train) %>%
       recipes::step_naomit(everything(), skip = TRUE) %>% 
       recipes::step_string2factor(disease) %>%  
       recipes::step_dummy(all_nominal_predictors()) 

disease_wf_model <- my_fit$.workflow[[1]]

I assumed the recipe was smart enough to detect when new data is being used for predictions. My bad.

I had not too much success on finding documentation on moving a tidymodel to production . I will keep searching.

Thanks Max

system · July 1, 2021, 9:57pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.