I'm trying to migrate from Jupyter Notebooks to Rstudio-Server due to somewhat poor support for R in Jupyter (+ conda). With Jupyter Notebooks, I often have long running jobs and multiple jobs running simultaneously (eg., long running genomic comparison analysis in 1 notebook and a set of quicker statistical analyses running in another notebook). As far as I can tell, Rstudio-server (basic) only allows for 1 process at a time, and Rstudo-server pro still only allows 1 process at a time per project. Is that correct? Is there any (good) way to have multiple processes running per project? If not, then I'm probably going to stick with Jupyter Notebooks, given that they provide multiple parallel running notebooks for the same "project". Otherwise with Rstudio-server, I think that I'd be stuck waiting for long running jobs, or I'd have to write (somewhat) complicated asynchronous job code (eg., write R script and use system(wait=FALSE)
).
For multiple process in R, you could use parallelisation and distribution. The {future} is a great unified tool for that.
{future} works with several backend like the parallel framework or HPC job scheduler, but it also allow to easily use multiple R process through {future.callr} that can launch up to 125 R process in parallel.
RStudio won't do that for you by default but you can use R and some to help you parallelize.
Thanks for the suggestions! futures
, doParallel
, batchtools
, etc. could definitely help with making long-running jobs run quicker, but I am worried about dealing with external bioinformatics software that can take a while run. With Jupyter Notebooks, I can document all bash calls to software for genome assembly, BLAST, metagenome analyses, etc., and these bash jobs can take a long time to run (eg., a long BLAST job). If I try to do with with Rmarkdown + knitr, can I call a bash job in a bash code chunk and then switch to a different project while that bash job takes hours or days to run, or am I stuck either waiting for the bash job to complete, or keeping the bash job external from Rmarkdown + knitr because it takes too long?
In other words, does Rmarkdown + knitr really work for documenting a bioinformatics pipeline? I know that I could use snakemake
or other pipelining software for such cases (and I sometimes do this), but that can require a lot of setup, which isn't needed for relatively simple pipelines.
I am not sure to follow. if you want to work on another project while a R process compute code from a Rmarkdown, why don't you use another new R session ?
RStudio Server Pro has this feature too open multiple session.
Otherwise, you may use {callr} or {processx} to compute some code in a background process for example. (May be launched form a code chunk I think).
For this kind of use case, I usually "deploy" my Rmd document (or R script) in an execution environment different from my development RStudio Server where I launch the execution (rmarkdown::render
for exemple). That way I can still work on other project (dev or analysis) on my working environment in RStudio server.
On this, there is the recent but very useful {drake} https://ropensci.github.io/drake/index.html
It has built in support for parallel computing in the workflow. You may find it interessting for bioinformatic pipeline. It is very well documented.
That's all for my ideas and shared experience...
Thanks @cderv for all of the really useful suggestions! My research group currently doesn't have RStudio Server Pro (we just have basic), but we are looking into purchasing it. I didn't know about callr
, processx
, or drake
. All seem very useful for what I'm looking for.
Since you've provided so much help, maybe you can provide one more bit of advice . I'd like each Rmd document to have a fully reproducible R environment. It seems that with packrat
, the user can snapshot their environment, but I don't see a way to create a snapshot with an ID mapped to a particular Rmd document. In other words, if I were to re-run the code in a Rmd document, I want to be able to load the exact R environment that I used last time when I ran that code. It seems like checkpoint
can be used for this, but checkpoint
appears to be less flexible with the R packages that it can maintain (eg., just MRAN). Any advice in this regard?
Packrat can help you generate a packrat.lock
file that snapshot the state (R version and which in which version) of your current session. That way a packrat::restore
will parse the lock file and recreate, in a project (or packrat) library the state. However, it will reinstall every package, each time you want to run the Rmd. Unless you have a common execution environment for all you Rmd and you can use some cached package (there is a cache feature in packrat) that will install package just one time, then point to that installation each time you recreate a specific environment.
This is the kind of behaviour that is implemented in the other production of RStudio, called RStudio Connect, that is an execution environment for Document (Rmd), API (plumber) and ShinyApps. With this product, the state is snapshot on your Dev environment and recreate on the server for each document (thanks to packrat but you don't have to be concerned about it, RStudio has done it for you). Awesome product !
I am not sure to understand... Checkpoint allow you to point to a specific date snapshot of the CRAN; This way, you script (or Rmd document) that have a checkpoint call (eg. checkpoint("2018-02-15")
will install package from this date to recreate the environment before running it. You can use a temporary folder to install the library to, checkpoint("2018-03-16", checkpointLocation = tempdir())
. This way it will be install just for the run. It is another solution for reproductible script (see Using Checkpoint for Reproductible Research.
Is this how you understood checkpoint ?
To double clarify, RStudio Server Pro supports multiple sessions per project, and across projects, sorry if there was any confusion.
Thank you for clarifying how packrat
can be used to create reproducible R environments for each Rmd file. RStudio Connect seems like a good option, but I'm wondering how well it works with installing and running non-R bioinformatic software for fully reproducible research. I'll need to look into it more.
In regards to checkpoint
, I thought that it doesn't work with R packages hosted on GitHub or local R packages (unlike packrat
), but maybe I'm wrong about this.
Thanks @slopp for clarifying. I had missed that in the RStudio-Server Pro documentation.
i think you are right on this point. Unless it has changed since last time I used it...