Hi,
Requesting your help or expert opinion on a parallelization issue I am facing.
I regularly run an Xgboost classifier model on a rather large dataset (dim(train_data) = 357,401 x 281, dims after recipe prep() are 147,304 x 1159 ) to predict student retention risk. In base R the model runs in just over 4 hours using registerDoParallel(using all 24 cores of my server). I am now trying to run it in the Tidymodels environment, however, I am yet to find a robust parallelization option to tune the grid.
I attempted the following parallelization options within tidymodels. All of them seem to work on a smaller subsample (eg 20% data), but options 1-4 fail when I run the entire dataset, mostly due to memory allocation issues.
- makePSOCKcluster(), library(doParallel)
- registerDoFuture(), library(doFuture)
- doMC::registerDoMC()
- plan(cluster, workers), doFuture, parallel
- registerDoParallel(), library(doParallel)
- future::plan(multisession), library(furrr)
Option 5 (doParallel) has worked with 100% data in the tidymodel environment, however, it takes 4-6 hours to tune the grid.
I would request your attention to option 6 (future/ furrr), this appeared to be the most efficient of all methods I tried. This method however worked only once (successful code included below, please note I have incorporated a racing method and stopping grid into the tuning).
doParallel::registerDoParallel(cores = 24)
library(furrr)
future::plan(multisession, gc = T)
tic()
race_rs <- future_map_dfr(
tune_race_anova(
xgb_earlystop_wf,
resamples = cv_folds,
metrics = xgb_metrics,
grid = stopping_grid,
control = control_race(
verbose = TRUE,
verbose_elim = TRUE,
allow_par = TRUE,
parallel_over = 'everything'
)
),
.progress = T,
.options = furrr_options(packages = "parsnip"),
)
toc()
Interestingly, after one success all subsequent attempts have failed. I am always getting the same error (below). Each time the tuning progresses through all CV folds (n=5), and runs till the racing method has eliminated all but 1 parameter, however, it fails eventually with the below error!
Error in future_map(.x = .x, .f = .f, ..., .options = .options, .env_globals = .env_globals, :
argument ".f" is missing, with no default
The OS & Version details I use are as follows:
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
I am intrigued by how furrr/future option worked once, but failed in all attempts since.
I have tried using the development version of tune
Any help or advice on parallelization options will be greatly appreciated.
Thanks
Rj