The easiest way to parallelise any code is with the {furrr}
package, which combines the parallelisation implemented by the {futures}
with the tidy
-ness of the {purrr}
package
Here's how to do it:
Re-write that for
loop as a single function, let's call it run_file()
, which takes a single argument, file
, which is the file name. You should be able to use most of the same code apart from the line which assigns fre3
to listofDataFrames
, instead you should return fre3
as the outcome of your function. It's also tricky to output the message since parallelisation means these operations are happening at the same time and messages can be garbled. You'd also need to output something, even when an error is hit.
We can write this to use the map()
function from {purrr}
as simply:
map(files,run_file)
This will apply the function run_file()
to every element in files
. This would still take a while as it is running in sequence and we haven't done any parallelisation yet. However, once we've got our code into the {purrr}
style, to convert it to {furrr}
, we just add future_
to the start of the function (works for most of the map()
family)
future_map(files,run_file)
There are a few options for this depending on your operating system. But if we use the plan()
function, we can tell {furrr}
how we want to run it, the default is sequential
, so if we put this alone, it'll still take a while. We can add plan(multiprocess)
which will choose a parallelisation depending on your OS:
plan(multiprocess)
listOfDataFrames <- future_map(files,run_file)
Another advantage of the {purrr}
package is that your outcome can be simplified before it gets finally returned. At the end of your code, you've simplified your listofDataFrames
into dt
by using rbind
. This can be done in {purrr}
with map_dfr()
and therefore can also be done in {furrr}
with future_map_dfr()
.
You can also add a progress bar to print to the console with the argument .progress=TRUE
.
Here's the final code, I also neatened up the function from your for loop to make it a bit smoother (you don't need to do quite as much assigning of variables):
library(furrr)
files <- list.files(path = dir, pattern = ".*\\.nc$",
ignore.case = TRUE, full.names = TRUE)
run_file <- function(file) {
river <- ncdf_timeseries(filename = file)
sample_filename <- basename(file)
dur <- ts(river, start=1979)
tryCatch({
res <- data.frame(sample_filename = sample_filename ,
mk = mk.test(dur),
slop = sens.slope(dur))
},
error = function(e){
res <- data.frame(sample_filename = sample_filename ,
mk = NA,
slop = NA)
}
res
}
plan(multiprocess)
dt <- future_map_dfr(files,run_file, .progress=TRUE)
write.csv(dt, "all.csv")
Edit: fixed link and added .progress