When building a recipe, how to replace missing values by imputing the mode in numeric variables that contain either 0 or 1?

emman · December 23, 2021, 2:58pm

I'm building a recipe, and I need to address missing values in binary variables. Those variables contain either 1, 0, or NA. I want to use imputation to replace the NA values, and found step_impute_mode().

However, step_impute_mode() accepts only nominal variables (i.e., of class factor or character). Although I could first use step_num2factor()and then step_impute_mode(), it is problematic because then I'm stuck with variables of class factor, whereas the model engine requires them to be numeric. As far as I could see, recipe package doesn't have step_*() verbs that convert from factor to numeric.

So my question is: how can I replace NA by imputing the mode in numeric variables that have values 0 and 1?

Thanks!

EDIT

Here's some toy data to demonstrate the situation. I would like to write a recipe for the formula y ~ ., and I want to impute the mode (respective to each column) to replace the missing values in x1 and x2.
Furthermore, I want x1 and x2 to remain numeric after the imputation. How can I do it using recipes package?

set.seed(123)
x1 <- rbinom(100, 1, runif(1))
x2 <- rbinom(100, 1, runif(1))
y  <- rbinom(100, 1, runif(1))

# sprinkle some NAs
my_df <- data.frame(y, x1, x2) 
my_df[c("x1", "x2")] <-
  lapply(my_df[c("x1", "x2")], function(x) {
    x[sample(seq_along(x), 0.25 * length(x))] <- NA
    x
  })

head(my_df)
#>   y x1 x2
#> 1 1  1  0
#> 2 1  0 NA
#> 3 0  1  0
#> 4 1 NA  1
#> 5 1 NA  1
#> 6 1 NA NA

^{Created on 2021-12-23 by the reprex package (v2.0.1.9000)}

joshua31 · December 24, 2021, 2:22am

Could you just use a custom mode function to impute the NA values a la:

my_df$x1[is.na(my_df$x1)] <- mymodefunction(my_df$x1[!is.na(my_df$x1)])
my_df$x2[is.na(my_df$x2)] <- mymodefunction(my_df$x2[!is.na(my_df$x2)])

emman · December 24, 2021, 10:02am

@joshua31 , thanks, but I'm looking for a solution within a recipe.

Gus · December 24, 2021, 4:15pm

Try this

set.seed(123)
x1 <- rbinom(100, 1, runif(1))
x2 <- rbinom(100, 1, runif(1))
y  <- rbinom(100, 1, runif(1))

# sprinkle some NAs
my_df <- data.frame(y, x1, x2)
my_df[c("x1", "x2")] <-
    lapply(my_df[c("x1", "x2")], function(x) {
        x[sample(seq_along(x), 0.25 * length(x))] <- NA
        x
    })

head(my_df)
#>   y x1 x2
#> 1 1  1  0
#> 2 1  0 NA
#> 3 0  1  0
#> 4 1 NA  1
#> 5 1 NA  1
#> 6 1 NA NA

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

imp <- recipe(my_df, y ~ .) %>%
    step_num2factor(all_numeric_predictors(),
                    transform = function(x) x + 1,
                    levels = c("0", "1")) %>%
    step_impute_mode(all_nominal_predictors()) %>%
    step_mutate_at(starts_with("x"), fn = ~ as.numeric(.) - 1)

imp %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 100 x 3
#>       x1    x2     y
#>    <dbl> <dbl> <int>
#>  1     1     0     1
#>  2     0     0     1
#>  3     1     0     0
#>  4     0     1     1
#>  5     0     1     1
#>  6     0     0     1
#>  7     0     0     1
#>  8     0     0     1
#>  9     0     1     1
#> 10     1     0     1
#> # ... with 90 more rows

^{Created on 2021-12-24 by the reprex package (v2.0.1)}

emman · December 25, 2021, 7:23am

Thanks @Gus ! That's the solution I was looking for.

system · January 1, 2022, 7:24am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.