AJF
October 24, 2019, 6:31pm
1
Hi,
Does anyone know of a good way to exclude certain calculated variables from later steps in recipes? My specific use case is that I create a dummy variable out of a character variable, and then I want to center all numeric variables. However, I don't want to center the dummy variable. Look at the example below:
library(dplyr, warn.conflicts = FALSE)
library(recipes, warn.conflicts = FALSE)
library(nycflights13)
small_df <- nycflights13::flights %>%
select(dep_delay, arr_delay, air_time, origin)
head(small_df)
#> # A tibble: 6 x 4
#> dep_delay arr_delay air_time origin
#> <dbl> <dbl> <dbl> <chr>
#> 1 2 11 227 EWR
#> 2 4 20 227 LGA
#> 3 2 33 160 JFK
#> 4 -1 -18 183 JFK
#> 5 -6 -25 116 LGA
#> 6 -4 12 150 EWR
rec <- recipe(air_time ~ ., data = small_df)
rec2 <- rec %>%
step_dummy(origin) %>%
step_center(all_predictors())
prepped_small <- prep(rec2, small_df) %>% juice()
head(prepped_small)
#> # A tibble: 6 x 5
#> dep_delay arr_delay air_time origin_JFK origin_LGA
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -10.6 4.10 227 -0.330 -0.311
#> 2 -8.64 13.1 227 -0.330 0.689
#> 3 -10.6 26.1 160 0.670 -0.311
#> 4 -13.6 -24.9 183 0.670 -0.311
#> 5 -18.6 -31.9 116 -0.330 0.689
#> 6 -16.6 5.10 150 -0.330 -0.311
Created on 2019-10-24 by the reprex package (v0.3.0)
origin_JFK should have values 0 and 1, not -0.33 and 0.67.
Is there a direct way to do it in recipes?
Thanks,
Hi @AJFm
Yes, you can do this in recipes. One way to do it would be to flip the order of your steps, and only center the numeric data, excluding the outcome (you may want to include the outcome in the centering).
rec2 <- rec %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_dummy(origin)
prepped_small <- prep(rec2, small_df) %>% juice()
head(prepped_small)
# A tibble: 6 x 5
dep_delay arr_delay air_time origin_JFK origin_LGA
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -10.6 4.10 227 0 0
2 -8.64 13.1 227 0 1
3 -10.6 26.1 160 1 0
4 -13.6 -24.9 183 1 0
5 -18.6 -31.9 116 0 1
6 -16.6 5.10 150 0 0
1 Like
Max
October 24, 2019, 7:48pm
3
You can also get rid of the dummy variables if you have to do it after normalization:
library(dplyr, warn.conflicts = FALSE)
library(recipes, warn.conflicts = FALSE)
library(nycflights13)
small_df <- nycflights13::flights %>%
select(dep_delay, arr_delay, air_time, origin)
head(small_df)
#> # A tibble: 6 x 4
#> dep_delay arr_delay air_time origin
#> <dbl> <dbl> <dbl> <chr>
#> 1 2 11 227 EWR
#> 2 4 20 227 LGA
#> 3 2 33 160 JFK
#> 4 -1 -18 183 JFK
#> 5 -6 -25 116 LGA
#> 6 -4 12 150 EWR
rec <- recipe(air_time ~ ., data = small_df)
rec2 <- rec %>%
step_dummy(origin) %>%
step_center(all_predictors(), -starts_with("origin"))
prepped_small <- prep(rec2, small_df) %>% juice()
head(prepped_small)
#> # A tibble: 6 x 5
#> dep_delay arr_delay air_time origin_JFK origin_LGA
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -10.6 4.10 227 0 0
#> 2 -8.64 13.1 227 0 1
#> 3 -10.6 26.1 160 1 0
#> 4 -13.6 -24.9 183 1 0
#> 5 -18.6 -31.9 116 0 1
#> 6 -16.6 5.10 150 0 0
Created on 2019-10-24 by the reprex package (v0.3.0)
4 Likes
AJF
October 24, 2019, 9:06pm
4
Thanks @Max and @mattwarkentin ! I appreciate your help! I had gotten so caught up in trying to use the role = argument in step_dummy() that I lost sight of simpler methods
1 Like
system
Closed
October 31, 2019, 9:06pm
5
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.