How dose tidymodels(recipes) handle categories that only exist in train?

Rsky · July 7, 2021, 2:44am

In the train data, there is a column named fruit, and the categories are apple, banana, and orange.
And in the test column, there are apple, banana and blueberry.

Question 1

If we make the fruit column a dummy variable, the test side will have NA.
Is there any way to deal with this?

Does the argument training in prep have anything to do with it?

sample code

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip


df_train<- tibble(id=c(1,2,3),fruit=c("apple","orange","banana"),num=c(100,200,300))
df_test<- tibble(id=c(1,2,3),fruit=c("apple","blueberry","banana"),num=c(100,500,300))


rec <- recipe(num~.,data=df_train) %>% 
  update_role(id, new_role= "ID") %>% 
  step_dummy(fruit, -all_outcomes(),one_hot=TRUE) %>% 
  step_zv(all_predictors()) %>% 
  prep()
  

rec%>% 
    bake(new_data=NULL)
#> # A tibble: 3 x 5
#>      id   num fruit_apple fruit_banana fruit_orange
#>   <dbl> <dbl>       <dbl>        <dbl>        <dbl>
#> 1     1   100           1            0            0
#> 2     2   200           0            0            1
#> 3     3   300           0            1            0

rec%>% 
  bake(new_data=df_test)
#> Warning: There are new levels in a factor: blueberry
#> # A tibble: 3 x 5
#>      id   num fruit_apple fruit_banana fruit_orange
#>   <dbl> <dbl>       <dbl>        <dbl>        <dbl>
#> 1     1   100           1            0            0
#> 2     2   500          NA           NA           NA
#> 3     3   300           0            1            0

Question 2

In addition, there are cases where step_zv does not work at df_test.
Please advise what to do in this case.

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

df_train<- tibble(id=c(1,2,3,4),fruit=c("apple","orange","banana","blueberry"),num=c(100,200,300,450))
df_test<- tibble(id=c(1,2,3),fruit=c("apple","blueberry","banana"),num=c(100,500,300))
rec <- recipe(num~.,data=df_train) %>% 
  update_role(id, new_role= "ID") %>% 
  step_dummy(fruit, -all_outcomes(),one_hot=TRUE) %>% 
  step_zv(all_predictors()) %>% 
  prep()


rec%>% 
  bake(new_data=NULL)
#> # A tibble: 4 x 6
#>      id   num fruit_apple fruit_banana fruit_blueberry fruit_orange
#>   <dbl> <dbl>       <dbl>        <dbl>           <dbl>        <dbl>
#> 1     1   100           1            0               0            0
#> 2     2   200           0            0               0            1
#> 3     3   300           0            1               0            0
#> 4     4   450           0            0               1            0

rec%>% 
  bake(new_data=df_test)
#> # A tibble: 3 x 6
#>      id   num fruit_apple fruit_banana fruit_blueberry fruit_orange
#>   <dbl> <dbl>       <dbl>        <dbl>           <dbl>        <dbl>
#> 1     1   100           1            0               0            0
#> 2     2   500           0            0               1            0
#> 3     3   300           0            1               0            0

Thanks for reading.

pathos · July 7, 2021, 4:57am

For question 2, step_zv() will remove near-zero variance columns. My guess is that because there is too little data, there is very little variance for it to work with.

Sorry I'm not understanding question 1.

system · July 28, 2021, 4:57am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.