MattM
December 30, 2021, 8:08pm
1
Hello,
I have a dataset with a categorical variable of 699 levels. I am predicting a binary response. I would like to encode a numeric variable with the mean per category level of the binary outcome. What is the best way to accomplish this? If possible, I would like to cross validate the predictor. Please see the following "attempt" with play data. A vignette or blog about this topic would be helpful!
library(tidyverse)
library(tidymodels)
set.seed(1)
dat<-data.frame(col_a=sample(letters,size = 10000,replace = TRUE),
col_b=sample(letters,size = 10000,replace = TRUE))
dat<-dat %>%
mutate(concat=paste(col_a,col_b,sep="-"))
set.seed(2)
y<-rbinom(n = 10000,size = 1,prob = .2)
dat$y<-y
concat_mean<-dat %>%
group_by(concat) %>%
summarise(concat_mean=mean(y))
dat<-left_join(dat,concat_mean)
dat$y<-as.factor(dat$y)
dat$imputed_mean<-NA
imputed_dat <-
recipe(y ~ ., data = dat) %>%
step_impute_linear(
imputed_mean,
impute_with = imp_vars(concat)
)
prep(imputed_dat)
Max
January 4, 2022, 1:56am
2
You can do this using the embed package . You definitely should not do it prior to resampling; use on the the step_lencode_*()
functions do it for you (see example below).
There is a vignette on these methods too.
library(tidymodels)
library(embed)
set.seed(1)
dat <-
data.frame(
col_a = sample(letters, size = 10000, replace = TRUE),
col_b = sample(letters, size = 10000, replace = TRUE),
col_c = rnorm(10000),
y = factor(sample(LETTERS[1:2], 1000, replace = TRUE))
) %>%
mutate(concat = paste(col_a, col_b, sep = "-"))
rec <-
recipe(y ~ concat + col_c, data = dat) %>%
# See functions named step_lencode_* in the embed package
step_lencode_mixed(concat, outcome = vars(y))
lr_spec <- logistic_reg()
set.seed(2)
resamples <- vfold_cv(dat)
lr_res <-
lr_spec %>%
fit_resamples(rec, resamples = resamples)
collect_metrics(lr_res)
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 accuracy binary 0.516 10 0.00479 Preprocessor1_Model1
#> 2 roc_auc binary 0.517 10 0.00636 Preprocessor1_Model1
Created on 2022-01-03 by the reprex package (v2.0.0)
1 Like
MattM
January 4, 2022, 2:55am
3
Thank you for the illustration. This makes sense and I think I can work from here.
MattM
January 4, 2022, 5:18pm
4
I have a related question: Should I down-sample my training dataset before or after likelihood encoding?
Max
January 4, 2022, 7:56pm
5
I would say before but you should try it both ways and see if there is a difference
1 Like
system
Closed
January 11, 2022, 7:57pm
6
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.