I was curious how can I use multiprocessing with purrr without using furrr.
Indeed, since the last updates it seems the furrr package does not work as good as in the past (in some case, worse). I think this has nothing to do with the package creator (which is an amazing coder, god-level like ) but more with the limitations of multiprocessing on some platforms.
Can I use future and purrr in a manual way? Am I completely mistaken here?
Actually, no matter what I try parallel part is much slower. I was trying to simplify it as much as possible, but it should give an idea how future can be used:
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> filter, lag
#> The following objects are masked from 'package:base':
#> intersect, setdiff, setequal, union
mytib <- tibble(text = rep('hello this is very interesting', times = 1000))
res_sequential <- map(mytib$text, ~str_detect(.x, regex("very")))
#> 0.03 sec elapsed
res_parallel <- map(mytib$text, ~future(str_detect(.x, regex('very')))) %>% values()
#> 4.512 sec elapsed
The reason the manual future() call is so slow is because you are calling it 1000 times. You are giving each element of mytib$text its own session to run in, but your computer probably only has ~8 cores to run on. So it sends out 8 requests, then has to wait until one finishes before sending out the next one....1000 times.
future_map() is much smarter. It chunks your input into 8 groups of roughly equal size, and sends out just those 8 chunks.
plan(multisession) is safer, and is the default in {future} when you use plan(multiprocess) on a Mac when using RStudio. But it is slower to start up because it has to copy resources over to the different R sessions.
That said, always look to vectorization over parallelization where possible. Your original example can be vectorized because str_detect() is vectorized over the input. This is with the full 1000000 times (not the smaller example above with 1000 times):
mytib <- tibble(text = rep('hello this is very interesting', times = 1000000))
xx <- str_detect(mytib$text, regex("very"))
#> 0.36 sec elapsed
Ha!! thank you @davis! I always wondered how future_map chunks the data. Is there any way to control the chunking then? What if I nest my compute variable into different groups and then call future_map? something like df %>% group_by(mygroups) %>% nest() %>% mutate(parallel = future_map(data, ~myfunc(.x))
This is potentially the key point. So what you say is that in order to make it work, I need to run the R script from the terminal directly? (that is without opening R studio?)
@davis you were right. running with Rscript did some magic. However, could you please tell me how to customize furrr a bit more? For instance choosing the number of chunks seems important (also is furrr using all the processors by default?)
