Errors when running brulee in parallel via parsnip model specification

davidli · February 29, 2024, 8:07pm

I am trying to run a neural network model using brulee. If running in parallel, I got the following error messages.

libgomp: Thread creation failed: Resource temporarily unavailable
Error in unserialize(node$con) :
MultisessionFuture (doFuture-3) failed to receive message results from cluster RichSOCKnode #3 (PID 262238 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 39 globals exported is 1.41 MiB. The three largest globals are ‘fn_tune_grid_loop_iter’ (356.12 KiB of class ‘function’), ‘predict_model’ (261.09 KiB of class ‘function’) and ‘metrics’ (167.09 KiB of class ‘function’)
libgomp: Thread creation failed: Resource temporarily unavailable

Here is the codes I am trying to run.

# Model specification
brulee_spec <-
  mlp(hidden_units = tune(),
      penalty = tune(),
      epochs = tune()) |>
  set_engine("brulee") |> 
  set_mode("classification") 

# Parallel backend
library(doFuture)
registerDoFuture()
parallelly::availableCores()
plan(multisession, workers = 12)

# Tunning
tune_res <- tune_grid(
  brulee_wf, 
  grid = param_grid, 
  resamples = ten_fold_cv, 
  metrics = metric_set(accuracy, bal_accuracy, j_index, mn_log_loss, brier_class, roc_auc, kap, mcc)
)

Using the same code, if replacing the engine with "nnet", everything went very well.

If remove the parallel backend, i.e., run it sequentially, a warning will display as follows:
→ A | warning: Current loss in NaN. Training wil be stopped.
There were issues with some computations A: x5

Max · February 29, 2024, 8:25pm

Can you please provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you!

If you've never heard of a reprex before, start by reading "What is a reprex", and follow the advice further down that page.

davidli · March 3, 2024, 3:54pm

I tried very hard to reproduce this error using "reprex" but failed to do so. Partially it is because this error is pretty tricky. But I think the following code provides a minimal example to reproduce the error.

Here, we are creating a recipe to normalize the numeric variables, a parsnip model and a workflow to fit the model. We will use the iris dataset for this example, the Species as the target variable, and the rest of the variables as predictors.

library(brulee)
library(tidymodels)
library(doFuture)
library(tictoc)

# Create a recipe
iris_recipe <- 
  recipe(Species ~.  , data = iris) %>%
  # Add normalization step for numeric variables
  step_normalize(all_numeric_predictors())

# Create a parsnip model using brulee
iris_spec <-
  mlp(hidden_units = tune(),
      penalty = tune(),
      epochs = tune()) |>
  set_engine("brulee") |> 
  set_mode("classification") 

iris_spec

# Create a workflow
iris_wf <- workflow() |> 
  add_recipe(iris_recipe) |> 
  add_model(iris_spec)

iris_wf

# Create a grid for hyperparameter tuning
param_grid <- grid_regular(hidden_units(range = c(10, 100)), epochs(range = c(100,1200)), penalty(range = c(-5, 0)), levels = 10)

# Create a cross-validation object
rset <- vfold_cv(iris, strata = Species, repeats = 2)

# Create a parallel backend
registerDoFuture()
parallelly::availableCores()
plan(multisession, workers = 12)
plan()

Now, we will fit the model using the iris_wf and the rset cross-validation object.

# Fit the model
tictoc::tic()
iris_res <- tune_grid(iris_wf, resamples = rset, grid = param_grid)
tictoc::toc()

An error will be thrown as below:
sh: fork: retry: Resource temporarily unavailable
sh: fork: retry: Resource temporarily unavailable
sh: fork: Resource temporarily unavailable
terminate called without an active exception
Error in unserialize(node$con) :
MultisessionFuture (doFuture-11) failed to receive message results from cluster RichSOCKnode #11 (PID 1406395 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 39 globals exported is 1.64 MiB. The three largest globals are ‘fn_tune_grid_loop_iter’ (356.12 KiB of class ‘function’), ‘grid_info’ (264.48 KiB of class ‘list’) and ‘predict_model’ (261.09 KiB of class ‘function’)
terminate called without an active exception

Here is the platform information.
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Rocky Linux 8.7 (Green Obsidian)

If the param_grid is lowered as below, the fitting can be completed without any issue.

param_grid <- grid_regular(hidden_units(range = c(10, 20)), epochs(range = c(100,200)), penalty(range = c(-5, 0)), levels = 3)

But once the error appears, even just a parallel requesting command, r plan(multisession, workers = 3), will throw the error, "sh: fork: retry: Resource temporarily unavailable".

But after killing all the R-related processes in the terminal using the following command.

pgrep -u XXX  -f '/usr/local/apps/R/4.3/4.3.2/lib64/R/bin/exec/R --no-echo --no-restore -e' | xargs kill

The code can run again using the lighter version of param_grid.

michaelmayer · March 15, 2024, 5:41pm

I am curious why you are using parallelly::availableCores() but then fix the number of workers in your plan() to 12 ?

While this is only a side observation that very likely is not related with your sh: fork: retry: Resource temporarily unavailable

For the fork problem I suspect either of the following

setup issue with your R installation (multi-multithreaded BLAS/LAPACK, pthreads issue, ... )
bug in one of the R packages you are using
ulimit settings being too narrow.

In order to exclude a bug in one of the R packages, I have created a slightly more complete reprex in this github repository: GitHub - michaelmayer2/brulee: reprex for Posit Community

If you clone this repo and then set the working directory of your Rstudio session to the main folder of the repo, you should be able to run renv::restore() and renv will install all R packages precisely as in my test environment. If you then eventually run the code, you should not get any error messages.

davidli · March 24, 2024, 1:00am

Thank you for wrapping the example with reprex. I cloned the github repo. It had the same problem if running the file test.R in RStudio. However, if running it from terminal via Rscript brulee_parallel/test.R, no errors occurred.

The problem is it runs very slow. With the small dataset of iris, it has kept running for over 10 hours; has not returned any results yet.

The weird thing is I issued 15 workers via plan(multisession, workers = 15). It seems 15 processes has been launched by Future package (See the screen shot by 'htop' command). Why only two processes are running -- the status column marked as "R", while the other process just in sleep (s) status.

I guess this is somehow related to the reset created via rset <- vfold_cv(iris, strata = Species, repeats = 2). But I have no ideas.

system · April 14, 2024, 1:00am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.