I'm curious to see if others have thought about using docker or vagrant as a means to setting up reproducible environments with R. Installing a package like "prophet" or "rJava" is very difficult with heavy dependencies such as rstan, and other packages needed.
I haven't seen much resources around "best practices" for creating dockerfiles with R specifically, and using CI services as well. Are there best practices that should be adopted, such as using Rscript vs install2.r ?
My main motivation for posting this is simply centered around: What are best practices you use in your dockerfile with R, and why?
How have others used docker with R? And if so, why or why not? Does it make sense to even use services like these with R?
There is a lot going on with R and Docker. It seems to me that it is widely used especially in a development workflow for testing. It is also used for reproductible reseach, for in R in production...
A trick I use sometimes to find use-case and example : Search github for keyword filtering on the R language.
27 pages of result if you do that for Docker and R language
Great question and lots of good info shared so far. I don't think the actual question has been answered, though.
I use docker to help make my projects more reproducible, but also more portable and thus easier to deploy. In addition, Docker forces me to be explicit about my application dependencies, which makes debugging and maintaining R apps easier. I've invested serious time learning how to use R and docker together, but I still don't know what best practices are.
There are definitely generic best practices for docker available, but R is a little different than what those articles typically address. Most articles on using docker assume you can pare your app down to the bare essentials. Many R apps do the opposite -- tie other apps together. The low level language dependencies of R apps also makes using docker more challenging. I have dockerfiles with 30 lines of installation steps, which can take 30 minutes to build the first time.
Lastly, while the rocker images are great, they aren't perfect. They often default to pulling from the latest version of everything. This is nice when testing, but not when using for production or reproduction. Rocker has versioned images now, but only for R, not for the underlying OS -- you get R 3.4.1 on debian:testing. I'd like to see more alternative options that use minimal OS like alpine and ubuntu core as the base. Would also like to have a standard scientific library dependency installed like how Anaconda distribution does. There is R support from Anaconda, but the docker images seem geared towards exploratory data analysis and not production R.
For anyone using our Pro products and deploying in a containerized environment, there are licensing considerations that you may be interested in knowing about. We have developed license structures that make deployment in such environments straightforward. Please email info@rstudio.com if you would like to discuss at any point.
@raybuhr Thanks, yeah the biggest pain-point I've encountered is simply verifying when installation of a package has started install into the container, and when it has completed install from the logs:
Maybe something along these lines:
pkgs <- c("RCurl", "jsonlite")
for (pkg in pkgs) {
if (!(pkg %in% rownames(installed.packages()))
install.packages(pkg)
paste0("Starting Install") }
I don’t think this is entirely fair to rocker. They have versioned images that are stable and I’ve been using those in production since they became available.
I use rocker and created my own container I call baser that has all of my dependencies setup. For me, this is relatively tight and static so while builds do take a long time, all my orgs code can build off of base and their own containers are fast to build. They just pull baser and add their code and do R CMD install essentially.
That said, it is a huge pain to do all that compiling every time if you aren’t able to get around that. I wish more binaries were available.
With those you get unlimited private repos to make all your custom Docker images, and they rebuild everytime you push to GitHub linked to a specific branch, so you can have production/dev/fork etc. to pull from.
I usually build on top of a specific Rocker image, adding all the dependencies as needed, and/or relying on the aforementioned containerit to build the Dockerfile. Then push it to GitHub to have the Docker image available, that are launched in their own VM.
The VMs run on their own version and pull data jobs etc., but if the code/environment updates I can run another VM with the new Docker to test it first before moving it over to the production main branch.
I think Docker is really useful for R in particular, being able to pin down the installed libraries etc.
Let's say I have some packages I want installed in my container. Do we really have to explicitly treat each line as it's own bash command in our Dockerfile? I think this is somewhat hard to read, and I'd like to be able to concatenate the packages into a vector, but bash does not like it when the vector is split across multiple lines.
I'm assuming others have this problem as well? I'm not using install2.r as I can't tell when packages start and finish install, which makes the log files very hard to decipher. I'm assuming others have run into this issue as well?
Is there a way to make something like this work without install2.r?
Do we really have to explicitly treat each line as it’s own bash command in our Dockerfile?
It is a bash command, if its readability I suggest two things:
Make your own base Dockerfile that has all the most common libraries installed, then call that in the FROM for your other docker files
Use \ to break out the lines so the dependencies are one per line
For example, I use a tidyverse and lot of my Google API packages so have a dedicated Dockerfile for those and dependencies:
FROM rocker/tidyverse
MAINTAINER Mark Edmondson (r@sunholo.com)
RUN apt-get -qqy update && apt-get install -qqy \
openssh-client \
qpdf
## Install packages from CRAN
RUN install2.r --error \
-r 'http://cran.rstudio.com' \
googleAuthR \
googleComputeEngineR \
googleAnalyticsR \
searchConsoleR \
googleCloudStorageR \
bigQueryR \
zip \
## install Github packages
&& installGithub.r MarkEdmondson1234/youtubeAnalyticsR \
MarkEdmondson1234/googleID \
cloudyr/googleCloudStorageR \
cloudyr/googleComputeEngineR \
## clean up
&& rm -rf /tmp/downloaded_packages/ /tmp/*.rds
COPY Rprofile.site /usr/local/lib/R/etc/Rprofile.site
then for specific situations if something is needed beyond that, its a case of calling it in FROM:
FROM gcr.io/gcer-public/persistent-rstudio
RUN install2.r --error \
-r 'http://cran.rstudio.com' \
newPackage
...which is as readable as it gets for a Dockerfile I think.
I have some public images here with various setups, as well as private ones with private packages installed from GitHub.
You don't have to use install2.r or installGithub.r (from littler) but they are more convenient IMO, and are available in the base rocker images, but you can do as you describe using devtools::install_github/cran if you want, although devtools is very heavy just for installing packages, perhaps try remotes instead.
"Heaviness" is a factor for Docker images as the idea is you only have exactly what you need in them, so you can build on top of them without having too many unexpected clashes.
The problem with this approach is that install2.r does not tell you when package install has started or completed. It does give you quite an elegant Dockerfile, but I'm more interested in when installation starts and completes for an array of packages, which devtools::install_cran does. The reason for this is because it could help others decypher their logs for debugging purposes and see what part of a log needed compilation for a given package as, those can tend to blur together at some points. I'm sure I'm not alone in this, but trying to follow the breadcrumb of install package->install dependency->install failure-> build error is vital for newcomers so they can see the error in their build.
If I want to make this readable on the Dockerfile, I'd like something like this:
So you can see each line groups similar packages together, but they are in one long vector, which should create less layers with one long command rather than six separate Rscript commands.
But bash does not like this approach due to splitting a string on multiple lines. Is there an easy way to make Rscript split a command across multiple lines? I'm assuming you need to escape the comma, and the white space to indicate where to find the end of the string?
What I'm trying to avoid is simply doing:
Rscript -e 'run installscript.R'
While it may look cleaner on a Dockerfile, does not make it readable at first glance, and forces the user to dig to figure out what is being installed on the container.
I recognise your use case and have come across similar issues myself I tend to comment out the libraries if there are problems but it does take a bit of detective work.
Perhaps the creators of rocker may be able to help, myself I can only see its a compromise between having multiple RUN commands so you can see the trail vs installing many layers:
FROM rocker/tidyverse
MAINTAINER Mark Edmondson (r@sunholo.com)
RUN apt-get -qqy update && apt-get install -qqy \
openssh-client \
qpdf
## Install packages from CRAN
RUN install2.r --error -r 'http://cran.rstudio.com' googleAuthR
RUN install2.r --error -r 'http://cran.rstudio.com' googleComputeEngineR
RUN install2.r --error -r 'http://cran.rstudio.com' googleAnalyticsR
etc..
...or making your own bash script that it used for installation and spits out more debugging info. I'll be interested in the solution if you get one that works for you.
Personally, I like this approach below, as I can comment out 5 packages, and shorten the vector to see what the culprit is, but not as good as install2.r approach.
I think this is the most readable solution if you wanted to use devtools to install a bunch of packages:
The outside needs to use double quotes, and single quoting each package, and escaping at each line will make it work.
It still might be nice to have a simple wrapper that wraps devtools::install_cran to output some ##### so that you can explicitly see what goes on within each install process.
As one of the maintainers at the Rocker Project I've really enjoyed this thread. We're very much a community-driven project and have learned a lot about best practices ourselves in the process, which is an ongoing process. The various tutorials online often reflect Rocker at different stages of that evolution, so I understand it can be confusing.
To help address this, we recently put together a little paper on the Rocker Project which speaks both to best practices and use cases: https://arxiv.org/abs/1710.03675. We're also building out our website to be a more accessible and up-to-date source of information: https://www.rocker-project.org/. Feedback on either would be great!
A few quick comments from my own perspective on the above issues: re installing, in Rocker project images I prefer to use install2.r with an alphabetized list of packages on new lines in the Dockerfile. This is easy to version control and keeps the Dockerfile clean; but more importantly, includes the optional --error (or -e) flag so that a failing package install causes the build to fail. install.packages and devtools::install_cran and friends only throw a warning when installation fails (and not all warnings are due to that failure) and this makes things hard to debug. It's not perfect, but I find the default output pretty helpful for debugging. With our latest and devel images building nightly, I see some pretty interesting errors, like when MRAN servers have a hiccup over a 19 hour period or race conditions CRAN package updates make an unsatisfiable version request for a brief period, so debugging this is possible.
We're always interested in ways that make this easier. Recently been testing an integration with binder (https://mybinder.org), where you can just drop a button onto your GitHub Readme that will launch a Rocker container on a server and install any additional packages you need by adding the install commands to a script called install.r in your repo. See: https://github.com/rocker-org/binder.
@jclemens1 Thanks for raising this issue. As you know but community isn't always aware of, we at the Rocker project did just this back at the start of the project -- even though we're only packaging the community-edition / open source version of RStudio and Shiny, since there are still trademark issues to be aware of. We've tried to make this clear in our footer notes (e.g. on https://www.rocker-project.org/), but welcome any other advice on highlighting these important issues!
I did a little post (well, collection of tweets) re R & docker last week, and would definitely love any advice you have on what rocker resources might be worth adding.
I'm totally new to Docker, so it can be tough to discern what's relevant, etc.