paralellization of several models workflow contained in a tibble (tidymodels)

DataMirai · January 2, 2025, 8:05pm

I am working in the calibration and specification of several models and seems. To have all the models processes tracked, i have stack all of them in a single tibble like this:

# Main label of the model and data for each situation -----

Tibble_Data_models <- tibble(
  Species = unique(iris$Species) ,
  data    = iris %>%  group_split(Species)
)

Tibble_Data_models

For each data, i have created its respective recipe and corresponded model.

# Recipes ----

Recipe_creation <- function(data) {
  recipe(formula = as.formula(' Sepal.Length ~ Sepal.Width'), data = data) %>%
    step_corr(all_numeric(), threshold = 0.99) %>%  
}

Tibble_Data_models <- Tibble_Data_models %>% 
  mutate('Recipe' = map(.x = data, .f = Recipe_creation))

# model creation -----

Model_RF <- rand_forest(
  mode = 'classification',
  engine = 'ranger',
  mtry = 50,
  trees = tune(),
  min_n = tune()
)

Tibble_Data_models <- Tibble_Data_models %>% 
  mutate('Model_SPEC_full' = rep( list(Model_RF), dim(.)[1] ))

Also, its resamples and the grid with the correspondent hiperparameters grid for each model config:

Tibble_Data_models <- Tibble_Data_models %>%
  mutate('Resamples'= map(data, ~ vfold_cv(..1, v = 10,strata = 'Species')))

Tibble_Data_models <- Tibble_Data_models %>%
  mutate('Grid_HP'= map(Model_SPEC_full , ~ grid_max_entropy( x = extract_parameter_set_dials(..1), size = 100 ) ))

Tibble_Data_models

To adjust the models, i created a function with safely that in case the the model calibration fails i do not get stucked.

F_safely_tune_grid <- purrr::safely(
 function(recipe, Model_SPEC_full, grid, resamples, seed = 1234567890, parallel_over = 'everything' ) {
   set.seed(seed)
   workflow(preprocessor = recipe, spec = Model_SPEC_full) %>% 
     tune_grid(
       grid = grid,
       resamples = resamples,
       control = control_grid(save_pred = FALSE, parallel_over = parallel_over))
})

So heres is the catch of the situation. When i try to parallelize the situation i do like this:

plan( strategy = multisession, workers = 10)

Tibble_Data_models <- Tibble_Data_models %>%
  mutate('HP_search' = furrr::future_pmap(
    list(receta_modelo , Model_SPEC_full, Grid_HP, Resamples),
    F_safely_tune_grid
    #,seed = 1234567890,
    #parallel_over = 'everything'
  ))

But, where do these workers go? to the future_map or the F_safely_tune_grid ? The workers should go inside allow a wide HP search, instead of the list in future_map. Am i parallelizing well this whole workflow?

Also, Is there any resource to check to check if its beeing any parallelization?, aside to microbenchmark The functions?

I consulted several posts, specially this post of SimonPCouch, but the library library(doMC) seems to not exist anymor in CRAN.

system · April 2, 2025, 8:06pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.