how to Adjusting the threshold of step_nzv()

Rsky · November 9, 2021, 1:26am

We want to remove values close to zero and sparse columns.
I've learned to use step_nzv()

library(tidymodels)

df <- tibble(
  y = rnorm(100000,10,10),
  x = rnorm(100000,5,2),
  col_1 = rep(0,100000),
  col_2 = rnorm(100000,0,0.2),
  col_3 = c(rep("A1",45000),
            rep("A2",45000),
            rep("B",9995),
            rep("C",2),
            rep("D",3))
)

rec <- 
  recipe(y~.,df) %>% 
  step_dummy(all_nominal()) %>% 
  step_nzv(all_predictors())

rec %>% 
  prep() %>% 
  bake(new_data=NULL)
  
  
# A tibble: 100,000 x 4
       x     col_2     y col_3_B
   <dbl>     <dbl> <dbl>   <dbl>
 1  3.44  0.0264    1.15       0
 2  1.56  0.0180   -8.34       0
 3  2.75 -0.00626  18.4        0

However, I don't understand how to use the arguments freq_cut and unique_cut.

Q1
How to eliminate col2 with small variance?

Q2
How do I delete the smallest col_3_C, leaving up to dummy col_3_D?
And how do I delete only both C and D?

thank you.

system · November 30, 2021, 1:26am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.