Parallelization with `future`'s `plan(multisession)` and `tune`

chillerb · February 18, 2025, 6:23pm

I'm trying to parallelize my evaluation pipeline. According to tune's documentation, the future framework is supported and parallelization over splits / parameters can be used by calling plan(multisession) before calling tune_grid or fit_resamples.

However, when I try to use it, it takes forever, and at some point R complains about exceeding the memory limit of 500 MB for each worker. It explicitly names the tune function as one of the main culprits which by itself takes up half a gigabyte already. I assume futures is loading all tidyverse dependencies into each worker? How can I avoid that? It is not intended to be like that, is it?

Thanks for reading, looking forward to your advice!

Max · February 20, 2025, 10:08pm

The memory requirements have not changed. When we moved to future, it reports that warning unless you adjust the threshold using options().

It is difficult to tell why things might be moving slide, though. Can you tell us about your hardware and the data, and show the code that you are using? Otherwise, we’ll have no idea.

I/we have been using it for some time with various analyses and data sets and have not experienced what you have reported, so we are interested in knowing what more we should be testing.

chillerb · February 21, 2025, 9:29am

Hi Max,
thanks for your reply! I must confess that I'm also kind of misusing GitHub as a forum, because you are actually the first person that replied to me on here. I created a GitHub issue on this problem in tune, where I also provide more information:

github.com/tidymodels/tune

Error in getGlobalsAndPackages

opened 07:26PM - 19 Feb 25 UTC

chillerb

## The problem I'm having trouble with parallelizing my `fit_resamples` pipelin…e using `future` for some reason. ```r library(tidymodels) library(furrr) plan(multisession, workers = 4) print(dim(my_data)) print(lobstr::obj_size(my_data)) splits <- rsample::group_vfold_cv(my_data, v = 5, repeats = 4, group = group) fit_resamples( linear_reg(engine = "glmnet", mixture = 1, penalty = 1), ga ~ . - group, splits ) ``` This results in: ```r [1] 54 962 693.65 kB Error in getGlobalsAndPackages(expr, envir = globals_envir, globals = globals, : The total size of the 38 globals exported for future expression (‘{; lapply(seq_along(...future.x_ii), FUN = function(jj) {; ...future.x_jj <- ...future.x_ii[[jj]]; {; {; NULL; ...; }, error = identity); }); }’) is 40.60 GiB.. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). The three largest globals are ‘fit’ (2.72 GiB of class ‘function’), ‘fn_tune_grid_loop_iter’ (2.71 GiB of class ‘function’) and ‘predict_model’ (2.71 GiB of class ‘function’) 16. stop(msg) at globals.R#365 15. getGlobalsAndPackages(expr, envir = globals_envir, globals = globals, packages = packages) at dofuture_OP.R#451 14. doFuture2(foreach, expr, envir = parent.frame(), data = NULL) at dofuture_OP.R#170 13. for_each %op% { fn_tune_grid_loop_iter_safely(fn_tune_grid_loop_iter = fn_tune_grid_loop_iter, split = split, grid_info = grid_info, workflow = workflow, metrics = metrics, control = control, eval_time = eval_time, ... 12. withCallingHandlers(expr, packageStartupMessage = function(c) tryInvokeRestart("muffleMessage")) 11. suppressPackageStartupMessages(for_each %op% { fn_tune_grid_loop_iter_safely(fn_tune_grid_loop_iter = fn_tune_grid_loop_iter, split = split, grid_info = grid_info, workflow = workflow, metrics = metrics, control = control, eval_time = eval_time, ... at grid_code_paths.R#200 10. rlang::with_env(rlang::env_clone(rlang::current_env()), { if (is_future) { for_each <- foreach::foreach(split = splits, seed = seeds, .options.future = list(seed = NULL, packages = packages)) ... at grid_code_paths.R#179 9. tune_grid_loop_impl(fn_tune_grid_loop_iter = fn_tune_grid_loop_iter, resamples = resamples, grid = grid, workflow = workflow, metrics = metrics, control = control, eval_time = eval_time, rng = rng, parallel_over = parallel_over) at grid_code_paths.R#54 8. fn_tune_grid_loop(resamples, grid, workflow, metrics, control, eval_time, rng) at grid_code_paths.R#15 7. tune_grid_loop(resamples = resamples, grid = grid, workflow = workflow, metrics = metrics, eval_time = eval_time, control = control, rng = rng) at tune_grid.R#355 6. tune_grid_workflow(workflow = workflow, resamples = resamples, grid = grid, metrics = metrics, eval_time = eval_time, pset = pset, control = control, rng = rng, call = call) at resample.R#143 5. resample_workflow(workflow = object, resamples = resamples, metrics = metrics, eval_time = eval_time, control = control, rng = TRUE) at resample.R#120 4. fit_resamples.workflow(wflow, resamples = resamples, metrics = metrics, eval_time = eval_time, control = control) at resample.R#58 3. fit_resamples(wflow, resamples = resamples, metrics = metrics, eval_time = eval_time, control = control) at resample.R#98 2. fit_resamples.model_spec(linear_reg(engine = "glmnet", mixture = 1, penalty = 1), ga ~ . - group, splits) at resample.R#58 1. fit_resamples(linear_reg(engine = "glmnet", mixture = 1, penalty = 1), ga ~ . - group, splits) ``` Is the issue on my side here? Sorry, but I don't quite understand how I can prevent the export of the global objects in that case.

I figured out a workaround, where the problem doesn't occur when I'm copying my data into a new matrix object for some reason. So I guess the problem is not related to tidymodels, but to how future detects global dependencies? Please let me know if you need more information. It's sometimes difficult for me to provide a reprex, since the project where I'm encountering those issues is kind of specific, and it's challenging to boil it down to a minimal version.

Max · February 21, 2025, 12:12pm

I appreciate the issue and looking into the problem. Some of the values that you are finding are highly inflates since it is including anything in your global environment at the time that you ran the code.

The limit that we have been hitting is related to system files, including things in packages attached by future. It doesn't include data or anything that we would pass to the functions being parallelized. This is variable for tidymodels; if you fit a model using package foo and foo requires a lot of memory, that is part of the cost of using foo.

I've made a pull request for changing the limit when tune is loaded. You can still moving it down or up but, based on my usage for the last year, the new limit should cover the dependencies that we use.

system · May 22, 2025, 12:13pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.