Parallelise in the tidyverse

joel · October 6, 2017, 10:02am

Hello,

I loved to use multidplyr with do to evaluate models on, e.g., subsamples of a dataset. I know there isn't yet an established way to do it in the map framework, but are there good practices I'm not aware of to parallelise model evaluation?

eric_bickel · October 6, 2017, 11:48am

You can map a modeling function against splits of a dataframe in parallel using foreach and %dopar% - it is extremely efficient in my experience!

davis · October 6, 2017, 11:49am

I think this is a great question. I've also used multidplyr to do work with multiple models, and have even dug into the code to parallelize a function or two of my own. There are rumors of purrr eventually supporting parallelization natively. See the github issue here, and a tweet here.

Recently I came across the combination of the future package with purrr, and I thought it was pretty neat. See that here.

davis · October 6, 2017, 1:30pm

For anyone that is interested in another example of purrr parallelization with the future package besides the one in the tweet, here is a silly random forest example with the weather data set from nycflights13. It's just meant to show the time difference of the two approaches, and that the parallelization actually works. The elapsed time is the important one.

Note that there is some overhead in parallelization, so spreading the 3 models over 3 cores is not exactly 3 times as fast!

library(future)
library(tidyverse)
library(nycflights13)

weather_nest <- weather %>%
  group_by(origin) %>%
  nest()

# Silly random forest model
weather_model <- function(data) {
  randomForest::randomForest(temp ~ dewp + humid + precip, data = data, na.action = na.omit)
}

# Test 1
  t1 <- proc.time()  
  
  # multiprocess chooses between multicore (Mac) or multisession (Windows)
  plan(multiprocess)
  
  # This returns instantly and begins running the models.
  # If you ran just this you would still be able to control your R
  # session and run other code. It is "non-blocking" because the computation
  # is being done somewhere else. On my Mac, I can open Activity Monitor 
  # and see that rsession is listed 4 times. Once for this session and 3 other
  # times for the 3 other cores (one per model)
  weather_nest_future <- mutate(weather_nest, 
                                wether_future = map(data, ~ future(weather_model(.x))))    
  
  # Once we run this, we "block" the R session that we are in, because we are
  # waiting for values() to return the results of the random forest 
  # Note that values() is a future function, not randomForest
  mutate(weather_nest_future, weather_value = values(wether_future)) 
  
  proc.time() - t1
  
  #  user  system elapsed 
  # 10.769   0.987   4.145 

# Test 2
  t2 <- proc.time()
  
  # This runs them normally, in sequence
  mutate(weather_nest, 
         weather_model_sequential = map(data, ~weather_model(.x)))  
  
  proc.time() - t2
  
  #   user  system elapsed 
  #  8.261   0.399   8.667

jepusto · October 6, 2017, 2:49pm

I think another possibility is to wrap map functions in do. See this github issue for an example of using purrrlyr::invoke_rows on a partitioned data frame.

eric_bickel · October 6, 2017, 3:14pm

purrlyr?? i'm about to dive into a rabbit hole.

thanks for this!

mara · October 6, 2017, 4:48pm

The tweet links back to a post by Henrik Bengtsson, The Many Faced Future, which covers a few implementations pretty well.

joel · October 7, 2017, 12:59am

Thanks Mara for the reference. I was aware of future and even used it once or twice, but was unaware of the furrr combo- just awesome !