Run an R script, which requires AWS credentials, on each core of a server as a docker image

alex628 · October 12, 2018, 12:45am

I have a script which a docker image runs. This script calculates some values then pushes those values to an S3 bucket. Everything has been tested and is running well.

The problem is that this script is slow. I would like to run this script on each core of a server with 32 cores. I have never done this, so maybe it is naive of me to think this is possible.

There is a similar thread on this topic on stackoverflow.

One solution given is writing running Rscript test_learn_script.R along with the bash commands nohup (a POSIX command to ignore the HUP (hangup) ) and & (a command to drop some code into the background). Using these commands a bash loop can be written as follows:

#!/bin/bash
# ---------------------------------------------------------------------------------
# Name:        rscript_loop.sh
# Description: Runs a rscript loop in the backround on each iteration of the loop.
#              The goal is to parallelize the script. Script in R defult to one
#              core. This loop should be able to extend to the number of cores on a
#              server.
#
# A solution provide here:
#  https://stackoverflow.com/questions/31137842/run-multiple-r-scripts-simultaneously
# ---------------------------------------------------------------------------------
for i in `seq 1 3`;
  do
  Rscript test_learn_script.R $i &
  done

I made a simple test_learn_script.R file for this question which looks like the following.

library(aws.s3)
test_env=Sys.getenv(c("R_HOME"))
AWS_ACCESS_KEY_ID=Sys.getenv("AWS_ACCESS_KEY_ID");
AWS_SECRET_ACCESS_KEY=Sys.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SECRET_ACCESS_KEY=Sys.getenv("AWS_DEFAULT_REGION")

classification_df <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
s3write_using(classification_df, FUN = write.csv,
             bucket = "www.tsdata",
              object = unique_name)

My test_learn_script.R file would run fine with the following if I was not having it iterate in a bash script.

docker run -e AWS_ACCESS_KEY_ID='***' -e AWS_SECRET_ACCESS_KEY='***' -e AWS_DEFAULT_REGION='***' my_docker_project

How can I parallelize my R code, which requires AWS credentials, to run on all 32 cores of a server as a docker image?

Also my Dockerfile is below:

FROM rocker/tidyverse:3.5.0
#
## install packages Ubuntu goodies
RUN apt-get update
#
##install R packages
RUN Rscript -e 'install.packages("forecast")'
RUN Rscript -e 'install.packages("devools")'
RUN Rscript -e 'install.packages("furrr")'
RUN Rscript -e 'install.packages("lubridate")'
RUN Rscript -e 'install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"))'
RUN Rscript -e 'devtools::install_github("tidyverse/ggplot2")'
RUN Rscript -e 'devtools::install_github("robjhyndman/tsfeatures")'
RUN Rscript -e 'devtools::install_github("ykang/tsgeneration")'
RUN Rscript -e 'devtools::install_github("alexhallam/tsMetaLearnWrap")'

# Add files in local machine directory
ADD . /usr/local/src/
WORKDIR /usr/local/src/
CMD ["./rscript_loop.sh"]

cderv · October 12, 2018, 6:13am

I see you already using furrr: what don't you use future and furrr to parallelize on the 32 cores instead of going with docker for this ? They are a very good way to parallelize your computation.

MarkeD · October 12, 2018, 8:21am

I do this on Google Kubernetes Engine, but same principle applies to AWS - you have it in Docker already, so as long as your script is stateless you can launch a Kubernetes cluster, scale up the pods and then decide how many VMs you would like to autoscale too.

Tamas has a great tutorial on this here

alex628 · October 12, 2018, 10:03am

Are you suggesting that I wrap test_learn_script.R in furrr? If so that makes a lot of sense. I would still probably keep the code in a docker image, but since I would be using furrr to parallelize as opposed to a loop in bash I should be able to pass the AWS credentials just fine.

alex628 · October 12, 2018, 10:03am

Thanks, I will check that out.

cderv · October 12, 2018, 10:18am

As you seem to use furrr already, it means you already parallelise in some way. Using future and one of its companion package, I think you can add a level of parallelisation above when lauching your script.

Using kubernetes could also be pretty simple as you have already dockerized! Nice idea @MarkeD!