Parallelization with `future`'s `plan(multisession)` and `tune`

I'm trying to parallelize my evaluation pipeline. According to tune's documentation, the future framework is supported and parallelization over splits / parameters can be used by calling plan(multisession) before calling tune_grid or fit_resamples.

However, when I try to use it, it takes forever, and at some point R complains about exceeding the memory limit of 500 MB for each worker. It explicitly names the tune function as one of the main culprits which by itself takes up half a gigabyte already. I assume futures is loading all tidyverse dependencies into each worker? How can I avoid that? It is not intended to be like that, is it?

Thanks for reading, looking forward to your advice!

The memory requirements have not changed. When we moved to future, it reports that warning unless you adjust the threshold using options().

It is difficult to tell why things might be moving slide, though. Can you tell us about your hardware and the data, and show the code that you are using? Otherwise, we’ll have no idea.

I/we have been using it for some time with various analyses and data sets and have not experienced what you have reported, so we are interested in knowing what more we should be testing.

Hi Max,
thanks for your reply! I must confess that I'm also kind of misusing GitHub as a forum, because you are actually the first person that replied to me on here. I created a GitHub issue on this problem in tune, where I also provide more information:

I figured out a workaround, where the problem doesn't occur when I'm copying my data into a new matrix object for some reason. So I guess the problem is not related to tidymodels, but to how future detects global dependencies? Please let me know if you need more information. It's sometimes difficult for me to provide a reprex, since the project where I'm encountering those issues is kind of specific, and it's challenging to boil it down to a minimal version.

I appreciate the issue and looking into the problem. Some of the values that you are finding are highly inflates since it is including anything in your global environment at the time that you ran the code.

The limit that we have been hitting is related to system files, including things in packages attached by future. It doesn't include data or anything that we would pass to the functions being parallelized. This is variable for tidymodels; if you fit a model using package foo and foo requires a lot of memory, that is part of the cost of using foo.

I've made a pull request for changing the limit when tune is loaded. You can still moving it down or up but, based on my usage for the last year, the new limit should cover the dependencies that we use.