As I'm personally not used to arrays, you can get that done with several CPUs (=cores =threads), within a single task.
#SBATCH --job-name=parallel
#SBATCH --cpus-per-task=10
#SBATCH --time=10:00:00
#SBATCH --mem=16G
module load R/4.1.3
Rscript test.R $SLURM_CPUS_PER_TASK
Then you can start your R script with retrieving the parameter:
args <- commandArgs(TRUE)
nb_cpus <- args[[1]]
And use that number of CPUs with {foreach}
or {furrr}
.
With this method, there is a single job, so a single instance of R. That can be an advantage (you need to run the boilerplate code only once, it's usually simpler to think about it), or an inconvenient (since the loop is executed in parallel, R will try to load all these data frames in parallel and can run out of memory).
The job array (or similarly, a multi-node (=multi-process) approach) is different: you write an R script that processes a single file, and you ask slurm to start many jobs that each start that script independently. This is conceptually equivalent to calling sbatch
many times. So, in that case there is no for
loop in the R code.
[I have not really used job arrays in the past, so I might be missing a better solution.] My impression is that the job array will only provide you with $SLURM_ARRAY_TASK_ID
, so you need to independently keep track of which file a given R job has to open. something like that:
batch file:
#!/bin/bash
### Job name
#SBATCH --job-name=parallel
#SBATCH --time=10:00:00
#SBATCH --mem=16G
#SBATCH --array=1-2
module load R/4.1.3
Rscript test.R $SLURM_ARRAY_TASK_ID
R file:
# Boilerplate ----
library(tidyverse)
df1 <- tibble(col1=c(1,2,3),col2=c(4,5,6))
df2 <- tibble(col1=c(7,8,9),col2=c(10,11,12))
files <- list(df1,df2)
# Find out where we are in the array ----
args <- commandArgs(TRUE)
current_task <- args[[1]]
# Run it ----
df3 <- as.data.frame(files[[current_task]]) %>%
summarise(across(everything(), list(mean=mean,sd=sd)))
write.table(df3, paste0("df",i))
If the boilerplate code is big, you might want to pre-compute it, write df1
, df2
etc in a directory somewhere, then launch a job array like:
all_files <- list.files("/path/to/dir")
df <- read.table(all_files[[current_task]]) %>%
as.data.frame() %>%
summarise(across(everything(), list(mean=mean,sd=sd)))
write.table(df, paste0("df",i))