I am working in the calibration and specification of several models and seems. To have all the models processes tracked, i have stack all of them in a single tibble like this:
# Main label of the model and data for each situation -----
Tibble_Data_models <- tibble(
Species = unique(iris$Species) ,
data = iris %>% group_split(Species)
)
Tibble_Data_models
For each data, i have created its respective recipe and corresponded model.
# Recipes ----
Recipe_creation <- function(data) {
recipe(formula = as.formula(' Sepal.Length ~ Sepal.Width'), data = data) %>%
step_corr(all_numeric(), threshold = 0.99) %>%
}
Tibble_Data_models <- Tibble_Data_models %>%
mutate('Recipe' = map(.x = data, .f = Recipe_creation))
# model creation -----
Model_RF <- rand_forest(
mode = 'classification',
engine = 'ranger',
mtry = 50,
trees = tune(),
min_n = tune()
)
Tibble_Data_models <- Tibble_Data_models %>%
mutate('Model_SPEC_full' = rep( list(Model_RF), dim(.)[1] ))
Also, its resamples and the grid with the correspondent hiperparameters grid for each model config:
Tibble_Data_models <- Tibble_Data_models %>%
mutate('Resamples'= map(data, ~ vfold_cv(..1, v = 10,strata = 'Species')))
Tibble_Data_models <- Tibble_Data_models %>%
mutate('Grid_HP'= map(Model_SPEC_full , ~ grid_max_entropy( x = extract_parameter_set_dials(..1), size = 100 ) ))
Tibble_Data_models
To adjust the models, i created a function with safely that in case the the model calibration fails i do not get stucked.
F_safely_tune_grid <- purrr::safely(
function(recipe, Model_SPEC_full, grid, resamples, seed = 1234567890, parallel_over = 'everything' ) {
set.seed(seed)
workflow(preprocessor = recipe, spec = Model_SPEC_full) %>%
tune_grid(
grid = grid,
resamples = resamples,
control = control_grid(save_pred = FALSE, parallel_over = parallel_over))
})
So heres is the catch of the situation. When i try to parallelize the situation i do like this:
plan( strategy = multisession, workers = 10)
Tibble_Data_models <- Tibble_Data_models %>%
mutate('HP_search' = furrr::future_pmap(
list(receta_modelo , Model_SPEC_full, Grid_HP, Resamples),
F_safely_tune_grid
#,seed = 1234567890,
#parallel_over = 'everything'
))
But, where do these workers go? to the future_map
or the F_safely_tune_grid
? The workers should go inside allow a wide HP search, instead of the list in future_map
. Am i parallelizing well this whole workflow?
Also, Is there any resource to check to check if its beeing any parallelization?, aside to microbenchmark The functions?
I consulted several posts, specially this post of SimonPCouch, but the library library(doMC)
seems to not exist anymor in CRAN.