Error in future_map: argument ".f" is missing, with no default

landRower · March 16, 2022, 11:50pm

Hi,
Requesting your help or expert opinion on a parallelization issue I am facing.

I regularly run an Xgboost classifier model on a rather large dataset (dim(train_data) = 357,401 x 281, dims after recipe prep() are 147,304 x 1159 ) to predict student retention risk. In base R the model runs in just over 4 hours using registerDoParallel(using all 24 cores of my server). I am now trying to run it in the Tidymodels environment, however, I am yet to find a robust parallelization option to tune the grid.

I attempted the following parallelization options within tidymodels. All of them seem to work on a smaller subsample (eg 20% data), but options 1-4 fail when I run the entire dataset, mostly due to memory allocation issues.

makePSOCKcluster(), library(doParallel)
registerDoFuture(), library(doFuture)
doMC::registerDoMC()
plan(cluster, workers), doFuture, parallel
registerDoParallel(), library(doParallel)
future::plan(multisession), library(furrr)

Option 5 (doParallel) has worked with 100% data in the tidymodel environment, however, it takes 4-6 hours to tune the grid.
I would request your attention to option 6 (future/ furrr), this appeared to be the most efficient of all methods I tried. This method however worked only once (successful code included below, please note I have incorporated a racing method and stopping grid into the tuning).

doParallel::registerDoParallel(cores = 24)
library(furrr)
future::plan(multisession, gc = T) 

tic()
race_rs <-  future_map_dfr(
  tune_race_anova(
    xgb_earlystop_wf,
    resamples     = cv_folds,
    metrics       = xgb_metrics,
    grid          = stopping_grid,
    control       = control_race(
      verbose       = TRUE,
      verbose_elim  = TRUE,
      allow_par     = TRUE,
      parallel_over = 'everything'
    )
  ),
  .progress = T,
  .options = furrr_options(packages = "parsnip"),
)
toc()

Interestingly, after one success all subsequent attempts have failed. I am always getting the same error (below). Each time the tuning progresses through all CV folds (n=5), and runs till the racing method has eliminated all but 1 parameter, however, it fails eventually with the below error!

Error in future_map(.x = .x, .f = .f, ..., .options = .options, .env_globals = .env_globals, :
argument ".f" is missing, with no default

The OS & Version details I use are as follows:
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so

I am intrigued by how furrr/future option worked once, but failed in all attempts since.
I have tried using the development version of tune

Any help or advice on parallelization options will be greatly appreciated.

Thanks
Rj

mattwarkentin · March 17, 2022, 3:31pm

Hi @landRower,

When using any of the {purrr} or {furrr} map functions, you have three choices for supplying the function (i.e. .f).

Let's assume we have one list/vector we are iterating over called x and we are using the function foo on each item:

If x is being passed to the function in the first argument position you can simply use map(x, foo). You can pass non-varying arguments into the ... of map, such as map(x, foo, other_arg = 'value')
If x is being passed on to any other argument position, we can use the formula notation with the .x placeholder like map(x, ~ foo(first = something, second = .x)
Lastly, we can use an anonymous function which takes x as its first argument: map(x, function(x) foo(x))

Okay, with that out of the way, future_map_dfr requires two arguments, .x and .f. You seem to only be passing in .x as future_map_dfr(tune_race_anova(...)) code.

Assuming you are iterating over xgb_earlystop_wf, something like this is syntactically correct code (though I don't think it will run, see below):

race_rs <-  future_map_dfr(
  .x = xgb_earlystop_wf,
  .f = ~ tune_race_anova(
    object = .x,
    resamples     = cv_folds,
    metrics       = xgb_metrics,
    grid          = stopping_grid,
    control       = control_race(
      verbose       = TRUE,
      verbose_elim  = TRUE,
      allow_par     = TRUE,
      parallel_over = 'everything'
    )
  ),
  .progress = T,
  .options = furrr_options(packages = "parsnip"),
)

As someone who uses the tidymodels ecosystem a lot, I am not sure what exactly you are trying to achieve with the above code. I think you only need to use this code on its own:

tune_race_anova(
    xgb_earlystop_wf,
    resamples     = cv_folds,
    metrics       = xgb_metrics,
    grid          = stopping_grid,
    control       = control_race(
      verbose       = TRUE,
      verbose_elim  = TRUE,
      allow_par     = TRUE,
      parallel_over = 'everything'
    )
  )

The parallelization happens internally, you do not need to use furrr/future to do manual parallel computation.

system · March 24, 2022, 3:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.