Parallel Processing issue with tune - variables not getting passed to workers

Hello,

I have previously had this issue when imputing with a recipe and parallel processing. I don't think I need to show the whole notebook to explain the issue.

I have the following recipe:


vars_median <- c("WarehouseToHome", "HourSpendOnApp", 
"OrderAmountHikeFromlastYear","CouponUsed" )

vars_linear <- c("Tenure", "OrderCount", "DaySinceLastOrder")


varstoimputewith <- names(Echurn_train[5:20]) 


churn_recipe <- Echurn_train %>% 
                recipe(Churn ~ .) %>% 
                step_rm(CustomerID) %>%
                step_impute_median(
                  all_of(vars_median) ) %>% 
                step_impute_linear(all_of(vars_linear), 
                 impute_with = varstoimputewith ) %>% 
                step_normalize(all_numeric_predictors()) %>% 
                step_mutate_at(all_logical_predictors(), fn =as.factor) %>% 
                step_dummy(all_factor_predictors()) 

But when tuning a model (like LASSO or KNN) using parallel processing, I will get an error because the variables

"vars_median" and "vars_linear" are not getting passed to the workers.

I have previously worked around this (based most likely on some google searches over a year ago) using


clusterEvalQ(cl, {vars_median <- c("WarehouseToHome", "HourSpendOnApp", 
"OrderAmountHikeFromlastYear","CouponUsed" ); 
vars_linear <- c("Tenure", "OrderCount", "DaySinceLastOrder")})

But now I am getting a warning that I should be using the futures package instead. In the futures package, I am not aware of a way to directly pass variables to the workers.

I can still solve this by explicitly coding the variables I want imputed in the recipe (it's just messier). Like this:


churn_recipe <- Echurn_train %>% 
                recipe(Churn ~ .) %>% 
                step_rm(CustomerID) %>%
                step_impute_median(
                  all_of(c("WarehouseToHome", "HourSpendOnApp",
                           "OrderAmountHikeFromlastYear","CouponUsed" )) ) %>% 
                step_impute_linear(all_of(
                             c("Tenure", "OrderCount", "DaySinceLastOrder")  ), 
                 impute_with = varstoimputewith ) %>% 
                step_normalize(all_numeric_predictors()) %>% 
                step_mutate_at(all_logical_predictors(), fn =as.factor) %>% 
                step_dummy(all_factor_predictors()) #


I am wondering if there is an approach here I am not considering or if there is a fix on the way in the tune package?

Note by the way that for whatever reason, "varstoimputewith" is passed as expected without an issue.

Thanks in advance.

Generally, it is best to avoid referencing global data in functions that will be run in parallel.

The best solution is to use quasi-quotation to splice in the data values (and not their reference). This is described more in ?selections and there are a few other examples:

and so on.

You might be seeing the issue now since we are moving over to a full future implementation (blog posts 1 and 2. It also deosn't help that different parallel processing methods treat global data differently. Forking methods automatically inherit them but PSOCK is very stringent.

1 Like

Thanks, just adding in a couple of !! in the recipe worked like a charm.

1 Like