tidymodels variables after step_XXX() functions

lisalendway · May 23, 2023, 6:28pm

Is there a function that returns the list/vector/etc. of variables that "survive" the step functions? I have a simple reproducible example below. I'd like to be able to get a list of variables that are left at the end - in this case, only x2. This seems somewhat related to this post.

library(tidymodels)

set.seed(123)

samp_size <- 1000

# Creating sample data where x2 is highly correlated with x1 and x3 has near-zero variance

sample_data <- tibble(x = rnorm(samp_size)) %>% 
  mutate(
    y = 3 + 2*x + rnorm(samp_size, 0, .5),
    x2 = x + rnorm(samp_size, 0,.1),
    x3 = c(rep(0, samp_size - 1), 1)
  )

# Didn't break out training and testing since it's not needed for this simple example.

simple_recipe <- recipe(y ~ ., sample_data) %>% 
  step_nzv(all_numeric_predictors()) %>%
  step_corr(all_numeric_predictors()) 

# When I run the steps, I'm only left with y and x2

simple_recipe %>% 
  prep() %>% 
  juice()

Emilhvitfeldt · May 23, 2023, 7:21pm

Good question!

for now you can use tidy() to see which variables are removed from each step.

library(tidymodels)

set.seed(123)

samp_size <- 1000

# Creating sample data where x2 is highly correlated with x1 and x3 has near-zero variance

sample_data <- tibble(x = rnorm(samp_size)) %>% 
  mutate(
    y = 3 + 2*x + rnorm(samp_size, 0, .5),
    x2 = x + rnorm(samp_size, 0,.1),
    x3 = c(rep(0, samp_size - 1), 1)
  )

# Didn't break out training and testing since it's not needed for this simple example.

simple_recipe <- recipe(y ~ ., sample_data) %>% 
  step_nzv(all_numeric_predictors()) %>%
  step_corr(all_numeric_predictors()) 

# When I run the steps, I'm only left with y and x2

prepped_recipe <- simple_recipe %>% 
  prep()

prepped_recipe %>%
  tidy(1)
#> # A tibble: 1 × 2
#>   terms id       
#>   <chr> <chr>    
#> 1 x3    nzv_dOXnH

prepped_recipe %>%
  tidy(2)
#> # A tibble: 1 × 2
#>   terms id        
#>   <chr> <chr>     
#> 1 x     corr_07DFJ

prepped_recipe %>%
  bake(new_data = NULL)
#> # A tibble: 1,000 × 2
#>        x2     y
#>     <dbl> <dbl>
#>  1 -0.612  1.38
#>  2 -0.206  2.02
#>  3  1.50   6.11
#>  4  0.192  3.07
#>  5  0.147  1.98
#>  6  1.65   6.95
#>  7  0.280  4.05
#>  8 -1.33   1.68
#>  9 -0.482  1.97
#> 10 -0.502  1.89
#> # ℹ 990 more rows

I have been thinking more deeply about this type of problems, and have some ideas outlined here Feature: Extracting names of variable input and output · Issue #1137 · tidymodels/recipes · GitHub

lisalendway · May 25, 2023, 2:50pm

I'll watch your development closely . I knew the tidymodels team would be on top of this! And, just to give a use case, I'm often asked how the predictor variables differ across certain groups. If I have maybe 500 variables to start but only 300 make it past the step_XXX() phase, it would be nice only to report on the 300, rather than 500.

I appreciate you answering the question so quickly.

system · June 15, 2023, 2:50pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.