We want to remove values close to zero and sparse columns.
I've learned to use step_nzv()
library(tidymodels)
df <- tibble(
y = rnorm(100000,10,10),
x = rnorm(100000,5,2),
col_1 = rep(0,100000),
col_2 = rnorm(100000,0,0.2),
col_3 = c(rep("A1",45000),
rep("A2",45000),
rep("B",9995),
rep("C",2),
rep("D",3))
)
rec <-
recipe(y~.,df) %>%
step_dummy(all_nominal()) %>%
step_nzv(all_predictors())
rec %>%
prep() %>%
bake(new_data=NULL)
# A tibble: 100,000 x 4
x col_2 y col_3_B
<dbl> <dbl> <dbl> <dbl>
1 3.44 0.0264 1.15 0
2 1.56 0.0180 -8.34 0
3 2.75 -0.00626 18.4 0
However, I don't understand how to use the arguments freq_cut and unique_cut.
Q1
How to eliminate col2 with small variance?
Q2
How do I delete the smallest col_3_C, leaving up to dummy col_3_D?
And how do I delete only both C and D?
thank you.