Learning to write code in parallel

swaheera · May 21, 2024, 3:16am

I am very new to the concept of parallel computing.

Here is my current understanding:

Suppose I have a function. I want to run this function 1000 times.
Let's say that each time I run this function, it is independent of other times I run this function
I imagine it this way: if I run this function 1000 times, its like a bakery with 1000 customers and 10 employees. However, all 10 employee work on the same customer, and then collectively move to the second customer.
But if I run the code in parallel, then 10 employees will take on the first 10 customers and work independently, potentially saving time.

I have the following R code that performs some random simulations (I can explain if required):

library(tidyverse)

# function
simulate_markov_chain <- function(simulation_num) {
  # Transition matrices
  transition_matrix_A <- matrix(c(1/3, 1/3, 1/3,  # probabilities from state 1
                                  1/3, 1/3, 1/3,  # probabilities from state 2
                                  0,   0,   1),   # probabilities from state 3
                                nrow = 3, byrow = TRUE)

  transition_matrix_B <- matrix(c(1/4, 1/4, 1/4, 1/4,  # probabilities from state 1
                                  1/4, 1/4, 1/4, 1/4,  # probabilities from state 2
                                  0,   0,   1,   0,    # probabilities from state 3
                                  1/4, 1/4, 1/4, 1/4), # probabilities from state 4
                                nrow = 4, byrow = TRUE)

  
  state <- 1
  chain <- "A"

  
  path_df <- data.frame(iteration = 1, chain = chain, state = state)

  
  iteration <- 1
  while (state != 3) {
    # Flip a coin
    coin_flip <- sample(c("heads", "tails"), size = 1, prob = c(0.5, 0.5))
    
   
    if (coin_flip == "heads" || chain == "B") {
      chain <- "B"
      state <- sample(1:4, size = 1, prob = transition_matrix_B[state, ])
    } else {
      state <- sample(1:3, size = 1, prob = transition_matrix_A[state, ])
    }
    
   
    iteration <- iteration + 1
    path_df <- rbind(path_df, data.frame(iteration = iteration, chain = chain, state = state))
  }


  path_df$simulation_num <- simulation_num

  return(path_df)
}

I then ran this function 1000 times - everything works perfectly::

results <- map_dfr(1:1000, simulate_markov_chain)

I am not sure how to run this code in parallel to speed it up.

Here is what I tried - it worked, but I am not sure if this is correct:

num_cores <- detectCores()
cl <- makeCluster(num_cores)
clusterExport(cl, "simulate_markov_chain")
results <- do.call(rbind, parLapply(cl, 1:100000, simulate_markov_chain))
stopCluster(cl)

What is the correct way to run this code in parallel using R?

michaelmayer · May 21, 2024, 8:56am

Hello @swaheera - there is a plethora of options you can choose to use parallel computing in R. It is very good that you already have a well written function that only takes one argument simulation_num).

Before paralllelizing, you also should make sure that the function you are parallelizing is optimized enough for speed. You can use tools like profvis to analyze and optimize performance.

Below I have modified your map_dfr call to make it more compatible with the latest purrr standards and then parallelized your code using doMCas well as future_map. It will auto-detect the available cores on your local server and use them all. On my little 4-core server I get execution times of the map_dfr call of 28 seconds, doMC is 10.9 seconds and future_map is 9.6 seconds. As you can see, using 4 cores will not give you 4 times the speed but it is more than 3x speedup. If you wanted to go beyond a single server, you then can use infrastructures like an HPC cluster where you then however need to deal with distributed computing which will involve other packages (doMC ==> doMPI and multisession ==> cluster future.

Please also note the use of various options for random number generation to ensure reproducibility of your results.

Lastly - the methods chosen here are only a small subsection of what is possible for parallelizing code in R. There is additional packages such as clustermq, batchtools and crew that can also be very helpful when parallelizing.

sims<-10000
seed<-12345

# single core approach 
set.seed(seed)
system.time(
  result1 <- 1:sims |> map(simulate_markov_chain) |> list_rbind()
) 


# doMC
library(doMC)
registerDoMC(parallelly::availableCores())
library(doRNG)
set.seed(seed)
system.time(
  result2 <-foreach(i=1:sims,.combine=rbind) %dorng%
  {
    simulate_markov_chain(i)
  }
)

# parallel futures
library(furrr)
library(purrr)

plan(multisession, workers = parallelly::availableCores())
set.seed(seed)
system.time(
  result3 <- 1:sims |> 
    future_map(simulate_markov_chain,.options = furrr_options(seed = T)) |> 
    list_rbind()
)

swaheera · May 21, 2024, 3:06pm

@michaelmayer : thank you so much for your wonderful answer!

swaheera · May 21, 2024, 3:10pm

@michaelmayer : in general, what are your favorite ways to parallelize code? You mentioned packages such as s clustermq , batchtools and crew ... do these offer any advantages compared to the approach you outlined?

Thank you so much!

michaelmayer · May 21, 2024, 4:13pm

This is a rather difficult answer and really depends on your personal preference and style of coding.

If you are heavily utilising the tidyverse ecosystem, then furrr/purrr IMHO are rather natural ways to parallelize codes given the fact that code changes are fairly minimal as you can see. Since furrr uses the future package, the utility of the parallelisation heavily depends on the desired level of parallelisation. As long as you are only want to use the cores on a given server, furrr is perfectly fine using the multisession backend and will yield to fairly efficient parallelisation. When it comes to muti-node parallelisation (i.e. the 1000 simulations being distributed across a number of servers) I sometimes find the cluster backend of the future package a bit slow, especially when it comes to launching the parallel processes. There is a future.batchtools interface for distributed computing but this has the disadvantage of relying heavily on local storage and hence may lead to non-ideal performance.

The same argument applies to batchtools as such - it is a great tool for having resilient parallel computing as it allows to rerun specific indices of your main simulation if they fail for some reason and you don't want to rerun the whole simulation. The downside of this is that batchtools is heavily relying disk storage to create a registry which can make things rather slow. clustermq and crew/crew.cluster on the other hand do everything in-memory or in-transit on the network and do not need any local storage. They have much less overhead but are also a bit more volatile when it comes to rerunning failed tasks. (cf. Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM, PBS/Torque) • clustermq that compares clustermq with batchtools wrt to overhead.

A highly desirable functionality of the packages crew/crew.cluster, batchtools and clustermq is that they are using so-called templates that keep the backend and its details separate from the actual R code. This makes it fairly straightforward to develop a code on a local laptop and then more or less easily migrate it to an HPC cluster where you can run your code distributed across 100s or more servers. This migration typically only involves swapping out the templates but does not need any changes to the R code.

Like with any technology it is important to be aware of it, understand it so that once the need arises you can use the same to your advantage in the best possible way.

As you probably can see, the topic of parallelization is a very wide field and we only can scratch the surface of it in this exchange I am afraid. I certainly encourage you to explore the mentioned packages as you see fit and use whatever fits your style of code development, infrastructure available and so on...

Lastly and specifically to your actual question - I am a bit of an old-fashioned guys and sometimes still struggle with tidyverse - I started my journey with R paralellisation many years ago with BatchJobs, then batchtools but soon converted to clustermq. As of late I am very much intrigued by the crew.cluster package that uses the latest generation of highly efficient communication frameworks (i.e. NanoNext).

But the future ecosystem is definitely a great one, too, and integrates very nicely into the tidyverse world. And there is ways to have nested parallelism with future that can unblock you in many ways.

PS: If you want to learn more about performance tuning with R including parallelisation, I can recommend you to browse the mentioned packages' homepages or alternatively even take a look at the material that a colleague and me produced for a couple of workshops ==> Go fastR: High Performance Computing with R

system · May 28, 2024, 4:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.