Trying to use furrr for parallel processing but can only see one processor working

dougfir · December 12, 2019, 3:30am

I'm working within hosted RStudio on a Linux EC2 server with 4 processors. In the past when I've used packages such as XGBoost or foreach, which use parallel processing, I have been able to watch the terminal and see all 4 processors light up within linux with top and pressing 1.

When I do that now here's what I see:
12 PM

There's only one processor working. I expected to see multiple processors at work like I have in the past.

Here's my code block. Hoping someone can recognize what I'm not doing right here? This is my first time using furrr.

library(rsample)
library(pscl) # hurdle and zero inflated models
library(furrr) # parallel processing
plan(multicore) # also tried multiprocess but no change, only one processor seems to run 

# otherwise error:
# Error in getGlobalsAndPackages(expr, envir = envir, tweak = tweakExpression,  : 
#   The total size of the 9 globals that need to be exported for the future expression (‘{; ...future.f.env <- environment(...future.f); if (!is.null(...future.f.env$`~`)) {; if (is_bad_rlang_tilde(...future.f.env$`~`)) {; ...future.f.env$`~` <- base::`~`; }; ...; .out; }); }’) is 1.76 GiB. This exceeds the maximum allowed size of 1.46 GiB (option 'future.globals.maxSize'). The three largest globals are ‘...future.x_ii’ (1.76 GiB of class ‘list’), ‘is_bad_rlang_tilde’ (15.05 KiB of class ‘function’) and ‘...future.map’ (6.61 KiB of class ‘function’).
options(future.globals.maxSize = 2000 * 1024^2)

# create train test split
set.seed(42)
pdata_split <- initial_split(pdata, 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)


# cross validation folds
pdata_cv <- vfold_cv(training_data, 5, strata = spend_30d) %>% 
  
  # create training and validation sets within each fold
  mutate(train = map(splits, ~training(.x)),
         test = map(splits, ~testing(.x))) %>% 
  
  # hurdle model for each fold
  mutate(hurdle_model = furrr::future_map(train, 
                            ~hurdle(formula = spend_30d ~., 
                                    data = .x,
                                    dist = "negbin")))

dougfir · December 12, 2019, 4:01am

I got it working by moving plan(multiprocess) to be directly above the first line of the dply chain of commands starting pdata_cv <- vfold_cv(training_data, 5, strata = spend_30d) %>%

Would be nice to understand why this is though.

cderv · December 12, 2019, 7:47am

About muticore, you can check if this is supported in your setup

future::supportsMulticore()

It won't work with RStudio for example.

Also, you did not specify any workers so the default is parallel::availableCores(constraints = "multicore") which depend on your setup.

If you don't support multicore, multisession will use some R processes - you can check the open process when you run your code to see if this is currently working.

I don't know moving the plan call closer to the pipe can help...

Hope it helps.

system · January 2, 2020, 7:47am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.