In the train data, there is a column named fruit, and the categories are apple, banana, and orange.
And in the test column, there are apple, banana and blueberry.
Question 1
If we make the fruit column a dummy variable, the test side will have NA.
Is there any way to deal with this?
Does the argument training in prep have anything to do with it?
sample code
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
df_train<- tibble(id=c(1,2,3),fruit=c("apple","orange","banana"),num=c(100,200,300))
df_test<- tibble(id=c(1,2,3),fruit=c("apple","blueberry","banana"),num=c(100,500,300))
rec <- recipe(num~.,data=df_train) %>%
update_role(id, new_role= "ID") %>%
step_dummy(fruit, -all_outcomes(),one_hot=TRUE) %>%
step_zv(all_predictors()) %>%
prep()
rec%>%
bake(new_data=NULL)
#> # A tibble: 3 x 5
#> id num fruit_apple fruit_banana fruit_orange
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 100 1 0 0
#> 2 2 200 0 0 1
#> 3 3 300 0 1 0
rec%>%
bake(new_data=df_test)
#> Warning: There are new levels in a factor: blueberry
#> # A tibble: 3 x 5
#> id num fruit_apple fruit_banana fruit_orange
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 100 1 0 0
#> 2 2 500 NA NA NA
#> 3 3 300 0 1 0
Question 2
In addition, there are cases where step_zv does not work at df_test.
Please advise what to do in this case.
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
df_train<- tibble(id=c(1,2,3,4),fruit=c("apple","orange","banana","blueberry"),num=c(100,200,300,450))
df_test<- tibble(id=c(1,2,3),fruit=c("apple","blueberry","banana"),num=c(100,500,300))
rec <- recipe(num~.,data=df_train) %>%
update_role(id, new_role= "ID") %>%
step_dummy(fruit, -all_outcomes(),one_hot=TRUE) %>%
step_zv(all_predictors()) %>%
prep()
rec%>%
bake(new_data=NULL)
#> # A tibble: 4 x 6
#> id num fruit_apple fruit_banana fruit_blueberry fruit_orange
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 100 1 0 0 0
#> 2 2 200 0 0 0 1
#> 3 3 300 0 1 0 0
#> 4 4 450 0 0 1 0
rec%>%
bake(new_data=df_test)
#> # A tibble: 3 x 6
#> id num fruit_apple fruit_banana fruit_blueberry fruit_orange
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 100 1 0 0 0
#> 2 2 500 0 0 1 0
#> 3 3 300 0 1 0 0
Thanks for reading.