This is a bit of a general question so I apologize in advance for that. My question is more discussion than specific.
Not sure if my approach is tidymodels or not, I'm copying my code block from an online course 'machine learning in the tidyverse', also apologize if I mis-tagged this post.
I'm working my way through a course on using tidymodels and have learned how to split a training dataframe into folds and then to iterate over each fold using map. This includes fitting a model to each fold e.g.
library(rsample)
library(pscl) # hurdle and zero inflated models
set.seed(42)
pdata_split <- initial_split(pdata, 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
# cross validation folds
train_cv <- vfold_cv(training_data, 5, strata = spend_30d) %>%
# create training and validation sets within each fold
mutate(train = map(splits, ~training(.x)),
test = map(splits, ~testing(.x))) %>%
# hurdle model for each fold
mutate(hurdle_negbin = map(train, ~hurdle(formula = spend_30d ~ d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook + spend_7d |
d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook + spend_7d,
data = .x,
dist = "negbin")))
This runs sequentially but it takes a bit of time.
Since I have 4 processors on my EC2 server I tried to run in parallel using furrr package. The code changes are so tiny to do this in parallel, all I have to do is library(furrr)
and then add plan(multiprocess)
above the code block I wish to run in parallel. Then, change map(train, ~hurdle(...)
to future_map(train, ~hurdle(...))
When I do this the code runs for a bit and I see all my processors light up in the terminal with top command > 1.
But, I get errors back and it looks like R is running out of memory. object.size(training_data) 483929064 bytes
After some searching I found references to BigMemory package but it uses a matrix. I guess I could use a matrix, but even if I did, would that be a feasible approach? Is there a conventional approach?
If I could fit multiple models in parallel, in this case for each fold, without running out of memory, it's the kind of golden code block I'd keep saved and use again and again.
In the past when working with parallel processing I've used the foreach package but, also due to memory issues, had to save each iteration as an rds file and then end each loop block with return 1
to get r to forget the code chunk it had just processed and saved to rds.
In the code block above, is there anything I can do to limit memory usage on each iteration within future_map? Is BigMemory worth a shot? If yes would it just be a case of changing training_data to big.matrix(training_data)?
What are some other paths worth exploring, if any, to fit models in parallel while minimizing the risk of running out of memory?