Tidymodels: Cross-Validated Target-Encoding

MattM · December 30, 2021, 8:08pm

Hello,

I have a dataset with a categorical variable of 699 levels. I am predicting a binary response. I would like to encode a numeric variable with the mean per category level of the binary outcome. What is the best way to accomplish this? If possible, I would like to cross validate the predictor. Please see the following "attempt" with play data. A vignette or blog about this topic would be helpful!

library(tidyverse)
library(tidymodels)

set.seed(1)
dat<-data.frame(col_a=sample(letters,size = 10000,replace = TRUE),
               col_b=sample(letters,size = 10000,replace = TRUE))

dat<-dat %>% 
  mutate(concat=paste(col_a,col_b,sep="-"))

set.seed(2)
y<-rbinom(n = 10000,size = 1,prob = .2)
dat$y<-y

concat_mean<-dat %>% 
  group_by(concat) %>% 
  summarise(concat_mean=mean(y))

dat<-left_join(dat,concat_mean)
dat$y<-as.factor(dat$y)
dat$imputed_mean<-NA

imputed_dat <-
  recipe(y ~ ., data = dat) %>%
  step_impute_linear(
    imputed_mean,
    impute_with = imp_vars(concat)
  )

prep(imputed_dat)

Max · January 4, 2022, 1:56am

You can do this using the embed package. You definitely should not do it prior to resampling; use on the the step_lencode_*() functions do it for you (see example below).

There is a vignette on these methods too.

library(tidymodels)
library(embed)

set.seed(1)
dat <- 
  data.frame(
    col_a = sample(letters, size = 10000, replace = TRUE),
    col_b = sample(letters, size = 10000, replace = TRUE),
    col_c = rnorm(10000),
    y = factor(sample(LETTERS[1:2], 1000, replace = TRUE))
  )  %>%
  mutate(concat = paste(col_a, col_b, sep = "-"))

rec <- 
  recipe(y ~ concat + col_c, data = dat) %>% 
  # See functions named step_lencode_* in the embed package
  step_lencode_mixed(concat, outcome = vars(y))

lr_spec <- logistic_reg()

set.seed(2)
resamples <- vfold_cv(dat)

lr_res <- 
  lr_spec %>% 
  fit_resamples(rec, resamples = resamples)

collect_metrics(lr_res)
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.516    10 0.00479 Preprocessor1_Model1
#> 2 roc_auc  binary     0.517    10 0.00636 Preprocessor1_Model1

^{Created on 2022-01-03 by the reprex package (v2.0.0)}

MattM · January 4, 2022, 2:55am

Thank you for the illustration. This makes sense and I think I can work from here.

MattM · January 4, 2022, 5:18pm

I have a related question: Should I down-sample my training dataset before or after likelihood encoding?

Max · January 4, 2022, 7:56pm

I would say before but you should try it both ways and see if there is a difference

system · January 11, 2022, 7:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.