(Context: This is about using Travis CI/CD to enhance reproducibility of random R projects, not package dev.)
Just this morning, I wasted another hour or so debugging a newly failed Travis build. (Turns someone else already wasted a day figuring out that I needed another system dependency).
This kind of thing keeps happening and Travis CI has become a major productivity sink.
I understand that this is, in some way, inevitable to ensure reproducibility: the failed Travis build indicated that I was relying on some undocumented state on my desktop (said system dependency).
Travis just doesn't make this easy:
- Even with all the caching bells and whistles the build times can be pretty slow in Travis (>>20mins), especially when using LaTeX. This can really drag out the debugging.
- Travis (now?) has a debug mode that you can SSH into, but it's kinda insecure (I hear), and it can still be hard to debug an R problem just from the shell of the headless VM you're being dropped into.
- Travis can sometimes be unreliable (connection timeouts, backlogs or other service disruptions).
- ...
- Most unnerving, debugging Travis is duplicate work, because I always additionally need to manage dependencies on my desktop. If you add some (yet subtly different) production environment on top, say shinyapps.io, you're multiplying sources of error.
All this led me to consider Docker again, where, supposedly, I could:
- Define all my (system) dependencies in a
Dockerfile
(using a versioned Rocker image). - Build the image and spin it up on my desktop, doing all of my analysis inside it.
- Just to be extra safe, have each commit trigger some (hopefully faster?) CI/CD (say, Google Cloud Build rebuild the image and compile my
*.rmd
or whatever. - Profit, because I'd now always only solve dependency management problems once.
- (As a bonus, I'd have an image which might be more easily deployable/scalable, but that's a different ballgame).
I've raised a related (but broader) topic before, with great suggestions from @cole and @tareef. There's also already a ton of fantastic resources and packages, many of them listed in this thread. It all reads pretty encouragingly.
On the other hand, this RStudio document sounds pretty cautious:
For data scientists, the time between starting a project and writing the first line of code is an important cost. Often dedicated analytic servers outperform containerized deployments by allowing users to create projects with little overhead.
and:
Dockerfiles do not ensure reproducibility. A Dockerfile contains enough information to create an environment, but not enough information to reproduce an environment. Consider a Dockerfile that contains the command “install.packages(‘dplyr’)”. Following this instruction in August 2017 and again in December 2017.
(I think you could go around this by using install2.r
and MRAN in your dockerfile
? Also, absent good ol' packrat, the same problem exists on Travis).
Proper tooling for an analytics workflow centered on Docker will take an order of magnitude more work than supporting a traditional, dedicated, and multi- tenant analytics server.
yikes.
So, I'm a bit confused, and worried this might be one of those situations:
Some people, when confronted with a problem, think “I know, I'll use regular expressions.”
Now they have two problems.
So I'm curious what other people's recommendations and experiences are with this:
Will dockerizing each R project save me time, at equal or greater reproducibility than the usual travis workflow?
(As mentioned at the outset, the concern here is with reproducibility and iteration speed – not pkg dev, deployment or scalability).