When there are too many levels in a factor, step_dummy(one_hot = TRUE) creates only 1 level dummy. 'Too many' = ?

suppressWarnings(suppressMessages({
  library(readr)
  #library(dplyr)
  library(lubridate)
  library(tidytable)
  library(tidymodels)
}))

dff = data.frame(yearr = sample(2015:2021, 2000, replace = TRUE),
                 monthh = sample(1:12, 2000, replace = TRUE),
                 dayy = sample(1:29, 2000, replace = TRUE)) |>
  mutate.(datee = ymd(paste(yearr, monthh, dayy)),
          weekk = week(datee),
          quarterr = quarter(datee),
          semesterr = semester(datee),
          doyy = yday(datee),
          y = sample(0:100, 2000, replace = TRUE) + (130 * yearr) + (2 * monthh) + (2 * weekk),
          dummyy = round(sample(0:1, 2000, replace = TRUE))) |>
  filter.(!is.na(datee)) |>
  arrange.(-desc(datee)) |>
  mutate.(ii = row_number()) |>
  select.(-datee)

columns_to_factor = c('yearr', 'monthh', 'quarterr', 'doyy')

dfff = dff |>
  mutate.(across.(.cols = all_of(columns_to_factor),
                .fns = as.factor,
                .names = 'factorr_{.col}'))

dffff = dfff |>
  recipe() |>
  step_nzv(all_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  prep() |>
  bake(NULL)

Main question: for testing purposes, I created the above example. As you can see, doyy has too many unique values in the column. I was just wondering what the threshold is where step_dummy(one_hot = TRUE) decides implement only one level.

Additional question: shouldn't one_hot = TRUE create 12 dummies for monthh, etc.? Why doesn't it do that?

There are no predictors in the recipe :neutral_face::

suppressWarnings(suppressMessages({
  library(readr)
  #library(dplyr)
  library(lubridate)
  library(tidytable)
  library(tidymodels)
}))

dff = data.frame(yearr = sample(2015:2021, 2000, replace = TRUE),
                 monthh = sample(1:12, 2000, replace = TRUE),
                 dayy = sample(1:29, 2000, replace = TRUE)) |>
  mutate.(datee = ymd(paste(yearr, monthh, dayy)),
          weekk = week(datee),
          quarterr = quarter(datee),
          semesterr = semester(datee),
          doyy = yday(datee),
          y = sample(0:100, 2000, replace = TRUE) + (130 * yearr) + (2 * monthh) + (2 * weekk),
          dummyy = round(sample(0:1, 2000, replace = TRUE))) |>
  filter.(!is.na(datee)) |>
  arrange.(-desc(datee)) |>
  mutate.(ii = row_number()) |>
  select.(-datee)
#> Warning: 2 failed to parse.

columns_to_factor = c('yearr', 'monthh', 'quarterr', 'doyy')

dfff = dff |>
  mutate.(across.(.cols = all_of(columns_to_factor),
                  .fns = as.factor,
                  .names = 'factorr_{.col}'))

dfff |>
  recipe() |>.  # <- set roles here
  step_nzv(all_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  prep() %>% 
  summary()
#> # A tibble: 14 × 4
#>    variable         type    role  source  
#>    <chr>            <chr>   <chr> <chr>   
#>  1 yearr            numeric <NA>  original
#>  2 monthh           numeric <NA>  original
#>  3 dayy             numeric <NA>  original
#>  4 weekk            numeric <NA>  original
#>  5 quarterr         numeric <NA>  original
#>  6 semesterr        numeric <NA>  original
#>  7 doyy             numeric <NA>  original
#>  8 y                numeric <NA>  original
#>  9 dummyy           numeric <NA>  original
#> 10 ii               numeric <NA>  original
#> 11 factorr_yearr    nominal <NA>  original
#> 12 factorr_monthh   nominal <NA>  original
#> 13 factorr_quarterr nominal <NA>  original
#> 14 factorr_doyy     nominal <NA>  original

Created on 2022-03-07 by the reprex package (v2.0.1)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.