I was curious how can I use multiprocessing with purrr without using furrr.
Indeed, since the last updates it seems the furrr package does not work as good as in the past (in some case, worse). I think this has nothing to do with the package creator (which is an amazing coder, god-level like ) but more with the limitations of multiprocessing on some platforms.
Can I use future and purrr in a manual way? Am I completely mistaken here?
Actually, no matter what I try parallel part is much slower. I was trying to simplify it as much as possible, but it should give an idea how future can be used:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(future)
library(purrr)
library(stringr)
library(tictoc)
plan(multicore)
mytib <- tibble(text = rep('hello this is very interesting', times = 1000))
#sequential
tic()
res_sequential <- map(mytib$text, ~str_detect(.x, regex("very")))
toc()
#> 0.03 sec elapsed
#multiprocess
tic()
res_parallel <- map(mytib$text, ~future(str_detect(.x, regex('very')))) %>% values()
toc()
#> 4.512 sec elapsed
The reason the manual future() call is so slow is because you are calling it 1000 times. You are giving each element of mytib$text its own session to run in, but your computer probably only has ~8 cores to run on. So it sends out 8 requests, then has to wait until one finishes before sending out the next one....1000 times.
future_map() is much smarter. It chunks your input into 8 groups of roughly equal size, and sends out just those 8 chunks.
plan(multisession) is safer, and is the default in {future} when you use plan(multiprocess) on a Mac when using RStudio. But it is slower to start up because it has to copy resources over to the different R sessions.
That said, always look to vectorization over parallelization where possible. Your original example can be vectorized because str_detect() is vectorized over the input. This is with the full 1000000 times (not the smaller example above with 1000 times):
library(stringr)
library(tibble)
library(tictoc)
mytib <- tibble(text = rep('hello this is very interesting', times = 1000000))
tic()
xx <- str_detect(mytib$text, regex("very"))
toc()
#> 0.36 sec elapsed
Ha!! thank you @davis! I always wondered how future_map chunks the data. Is there any way to control the chunking then? What if I nest my compute variable into different groups and then call future_map? something like df %>% group_by(mygroups) %>% nest() %>% mutate(parallel = future_map(data, ~myfunc(.x))
This is potentially the key point. So what you say is that in order to make it work, I need to run the R script from the terminal directly? (that is without opening R studio?)
@davis you were right. running with Rscript did some magic. However, could you please tell me how to customize furrr a bit more? For instance choosing the number of chunks seems important (also is furrr using all the processors by default?)
Thank you again for all the amazing work you do (slider is a gem)