I'm dreaming of a (hosted) RStudio CI/CD

maxheld83 · May 26, 2018, 8:50pm

I love R and the RStudio/tidyverse ecosystem, but despite the fact that I am a full-time R dev now, I spend way too much time managing (R and external) dependencies and maintaining R environments, including:

my local machine
shinyapps.io (for apps, both for projects and for packages)
travis-ci.com (for R CMD build but also general project CI/CD)
(soon) our own RStudio Connect server
at some point (god forbid) docker container with all that stuff deployed to clients

I feel like I am solving the same problems (R and system dependencies) over, and over, and over again – or at least more than just once. This is now at a point where it's really hurting the viability of my work in the ecosystem (perhaps my fault).

The recent Ubuntu / R 3.5.0 shenanigan might be example for this problem: my deploys failed b/c travis-ci docker container (unaccountably?) already had R 3.5.0, but shinyapps.io did not (not RStudios fault). So the two environments where out of whack, and I had to figure out why. It's a pretty minor thing that was quickly solved, but these things multiply with each R environment you have to worry about.

I know some people might get by just with deploying from within RStudio (rsconnect::...) but I would guess that with some collaborators, this kind of process gets weird fast (without any CI/CD).

I'm not here to complain or anything, I just wanted to find out whether other people were facing the same challenges, or whether I was somehow doing it wrong.
Perhaps, if I don't have this completely wrong, it might also be interesting for RStudio to gauge interest in related features or products.

I haven't thought this through, and I don't understand much of the scaffolding behind it, so this may be completely ignorant, but, I've been dreaming for some time of a (hosted) RStudio CI/CD product (with deployment to shinyapps.io and perhaps https://rstudio.cloud at some point).

I understand that replicating something like Travis or Jenkins just for R might be insane scope creep, but I'd just really love to only have to worry about one canonical R build and deploy environment.

Several ideas, in increasing complexity come to mind:

make shinyapps.io, rstudio.cloud and self-hosted connect travis-ci deployment providers.
let shinyapps.io and connect take its dependencies optionally from DESCRIPTION (not packrat, which can be hard to reason about). DESCRIPTION at least, is already used on the travis-ci R docker container and, of course, R packages, so that would make things a little easier. (I understand that packrat is way more powerful than DESCRIPTION deps, and DESCRIPTION does not ensure availability of old binaries (though Microsofts timemachine does). Packrat is just so complex, it often seems to cause more problems than it solves.)
let rstudio.cloud (and connect) commit to github from inside the respective browser UIs, instead of just save. That would really dramatically lower the cost of entry.
create (or just document?) an easy way for me to download and run the shinyapps.io docker image, so I can debug it interactively on my machine (or just switch all my development into that container, so I don't have to worry about my local environment).
I would also be quite happy to cut travis out of the loop entirely, if only RStudio (Connect?) had a hosted variant and (better) CI/CD integration. (I previously mentioned this in ticket http://support.rstudio.com/hc/requests/19752 and Jeff Allen had some encouraging words).

I just want to be able to have a project (or package) locally, to write up the dependencies once and then to push it to some well-defined cloud environment at RStudio (or on-prem), and have it be tested and deployed there, all from the same environment, be it hosting a shinyapp, an RStudio Connect product, some bookdown deployment, or even an R CMD check for a CRAN-bound package.

I can't stress how happy I would be to pay for this on a monthly basis, even just for the CI/CD service, that'd be easily worth 50-100$ for me, just as long as I can cut down on the dependency hell.

Hope this is meaningful/helpful and I'm not being too unclear/incompetent.

tareef · May 29, 2018, 10:38pm

Thanks @maxheld83, we have been hearing from more and more people that they would like help with getting pre-built binaries for Linux for both R and the packages. It sounds like what you are looking for here is the ability to define your bundle, get it tested in the cloud, and then be able to use it in a variety of places. What I am less clear on is whether all your use cases would be in the cloud, or whether you would want to be able to download the docker container that serves as a foundation for your work.

It may make sense to schedule a 30 minute call with a few people here to see if we can flesh out the requirements a bit more and make sure we understand the ask. If there others who are interested, please let us know. I would love to see how we can help.

maxheld83 · May 31, 2018, 6:31pm

oh wow, thanks for taking the time to respond to this @tareef.

Let me try and explain a bit more (shorter, though ).

As I imagine many people, I currently have a pipeline that involves rendering (*.Rmd) or compiling (packages) R projects on:

local machine
travis (for CI/CD from GitHub)
shinyapps or rstudio connect or shiny server pro

each of which are at least subtly different R environments. This increases complexity, and causes hickups such as last weeks mismatch b/w (apparently) shinyapps and travis' R docker image, one of which had 3.5.0, the other didn't, which borked up the app deploy (The delay on shinyapps isn't the issue; the mismatch is).

This is further complicated by the way in which dependencies (both R and beyond) are specified:

local: homebrew or apt-get, and just random install.packages(), or of course packrat.
travis: DESCRIPTION for R packages, and .travis.yml for system dependencies (mostly apt-get)
shinyapps: shinyapps-package-dependencies for system deps, and dark magic packrat for R packages.

This just opens up a lot of sources for confusion, and I sometimes feel like I am solving problems several times.

I my not understand enough about the tech underneath to make a good suggestion how to improve this.

But: using the same docker img on my machine and on shinyapps (and hosted connect, and http://rstudio.cloud) would be fantastic indeed. Perhaps even integrated into RStudio, with a short pointer in the .Rproj file (or DESCRIPTION or somewhere else) to which version of the canonical RStudio docker img I currently want.

Then, to have some CI/CD, it'd be really nice to be able to check and deploy directly from GitHub, and have the some RStudio service launch and run, check, render and deploy my R-project (with reference to the container), with some minimal (travis/jenkins-like) UI (and gh integration) which tells me the build status for every commit.

People who want to go off into the docker deep end can of course already do this. But I imagine that I'm not the only person who doesn't want to provision my own docker image (even off of the popular R docker images). I'd actually prefer a canonical docker image maintained by RStudio, with limited dependencies (much like
shinyapps-package-dependencies actually).

Am I making any sense?

tareef · June 5, 2018, 7:48pm

Yes, I think I understand the use case. Essentially, it would be nice if you could configure your working environment in one place, and then essentially make it usable in a variety of other places, but have someone else worry about how to make that docker image available in these different places. Do i have it mostly right?

maxheld83 · June 6, 2018, 7:14am

yes, exactly.

I also wouldn't mind using working environment (~ Docker image?), curated by RStudio, with reasonable (but limited) default system dependencies, and a carefully tested migration to new R releases and such.

I like the way the shinyapps.io images are limited, tested and maintained by RStudio. (The shinyapps-package-dependencies repo also seems to be an elegant and rigorous way to add edge case sys deps).

Perhaps this image might even default to a set of (nonconflicting) versions of some core R packages (say, tidyverse + r-lib).

Basically, I'd be happy if there was an easy way for me to use the shinyapps image locally in RStudio (via docker), as well as on some (RStudio?) hosted CI/CD service.

Lastly, I'd love it if rsconnect::deploy... would also allow me to use DESCRIPTION or Microsoft's checkpoint package to specify R dependencies. Anything but packrat under the hood, which I apparently am too stupid to use. This is related to the above concern, because (to me) the packrat-black-magic in rsconnect::deploy...ments greatly complicates things by adding another orthogonal dependency management.

cole · July 30, 2018, 4:51pm

So on your last comment about R package dependencies. Is your qualm with packrat that using the package to maintain your own environment is painful? Or is there some pain you are having with the deployment process itself?

I can resonate with the fact that the packrat usage in the rsconnect::deploy step does nothing to help with reproducing the environment locally that you had when deploying / committing / at a previous time / etc. Also, as much as I love packrat, I know there are some pain points in the user flow that make it tricky to use. Just want to make sure I have a good picture of what features you are looking for in package dependency management.

maxheld83 · July 30, 2018, 7:22pm

thanks for taking the time to circle back to this @cole and apologies if my comments are a little all over the map.
Also, disclaimer: I gave up on packrat 18 months ago or so, so I am not up to speed with what the package can do.

The deployment process via rsconnect::deploy... works fine, no qualms here (though I would like to have every git commit automatically deployed).
My qualms are indeed with the fact via the dark magic that is packrat, rsconnect::deploy... adds another orthogonal layer of specifying dependencies (on top of DESCRIPTION).

Another recent example to illustrate (hope I got this right): I had a commit which had some package_foo::bar() call, but had never Importsed it in DESCRIPTION.
Building on Travis worked, because said code was never called in my build script.
But them, oddly, rsconnect::deploy() (from Travis to shinyapps.io) failed, because it couldn't find package_foo to add to the manifest.
At first I was confused, then I realised that this was b/c under the hood, packrat was apparently parsing the entire source for dependencies, and then got mad, because it couldn't find the package in the Travis session.

This is not a bug and I was able to fix this quickly, but these things add up, and sometimes I take a long time to figure them out. Seems to me like DESCRIPTION + packrat + x is too much.

I'd really love to standardise on one way to express dependencies, even if that's at a cost.

To me, DESCRIPTION is the most attractive way, because it's already used in two prominent use cases (pkg dev and travis CI/CD), and it covers a decent amount of edge cases (via github remote).

So, my (probably ill-informed?) wish list right now would be:

let me disable packrat in rsconnect::deploy(), and just let me bring my own DESCRIPTION for shiny apps, even if it's not a package (just as on travis ci/cd).
merge/standardize the Travis CI/CD docker image with the docker images behind shinyapps.io, and let me use those everywhere and easily (locally within RStudio, on Travis CI/CD, on shinyapps.io).
standardize on https://github.com/rstudio/shinyapps-package-dependencies as a way to specify system dependencies.

Finally, I'd love to pay for services around this. RStudio Connect meets Travis CI meets Heroku kind of thing.

Still happy to talk about this if you, @tareef and others are still interested.

maxheld83 · July 30, 2018, 7:27pm

Looked at from a distance, I now think I'm really starting from two quite separate pain points:

packrat (inside rsconnect::deploy()) + DESCRIPTION is too much. I'd like to just use DESCRIPTION. DESCRIPTION is also human-readable, and just overall simpler than packrat.
I'm having a hard time reconciling a Github-based CI/CD workflow with shinyapps.io and especially RStudio Connect. So I go through Travis in between, which is a subtly different build environment.

iain · July 30, 2018, 8:24pm

As an aside but related note I would love to have connect commit my code to git on every deploy!

cole · July 30, 2018, 8:33pm

These are great thoughts! Thanks so much for the thorough post. I can definitely resonate with DESCRIPTION and packrat feeling redundant, and then difficult to debug because the same information is essentially being stored in two places.

One thing that would help me clarify a bit - are you building a package? How are you using the DESCRIPTION file if not when building a package? Or are you building the package and then deploying something to do with the package?

Usually DESCRIPTION files are used to manage minimum viable dependencies of an R package (not necessarily an arbitrary R script). Further, the reproducibility of a DESCRIPTION file is limited because it only points at the latest version of a package. For instance, your Travis CI build using DESCRIPTION could fail if your deploy picks up a different version of a package than you used in dev (because CRAN or GitHub got updated). Further, DESCRIPTION is usually used for the minimum set of packages required to use a package's exported functionality (i.e. dependencies) in contrast to the minimum set of packages required to develop a package (i.e. build dependencies or developer dependencies or something). When using packrat and DESCRIPTION together, this distinction is usually how I keep myself sane (packrat for build dependencies, DESCRIPTION for use-able dependencies).

RStudio Connect / shinyapps.io use packrat, which locks down a specific package version and repository to be sure that the correct version of the package is installed (even if a newer version is released). Further, it flexibly allows installing a specific GitHub commit hash and newer/older versions of packages (checkpoint requires that you pick a date in CRAN history - hopefully that has all of the versions you want). That said, I know that the UX can be pretty clunky. My hope is that packrat or some replacement will ultimately mesh the vision with the user experience and allow for the type of single-source-of-version-truth that you are looking for.

I definitely hear you loud and clear on the CI/CD workflow being tricky to marshal with shinyapps.io and RStudio Connect. We are hopeful of improving that process in the near future, so I will make a note to let you know when we can improve that story!

A few follow-up questions:

Do you like just grabbing the latest version of all packages? Would you expect breakages/errors from picking up new versions on deploy? Would you be amenable to committing a file in your GitHub repository that specifies which package versions you are using locally for development?
Are you familiar with GitHub hooks? Would you be comfortable with using a GitHub hook to drive deployment and bypass the Travis CI build server?

I definitely love packrat for its vision and have become accustomed to some of the clunky UX. One piece you might look at using, if you are interested in tracking exact package versions over time is packrat::.snapshotImpl(".", snapshot.sources = FALSE) which just creates a packrat.lock file by parsing your code without any other "magic." Obviously, restoring an environment from that .lock file will require using more of packrat's shenanigans, but this can be a useful way to see the actual dependencies of your parsed code in a concise fashion!

maxheld83 · July 30, 2018, 8:35pm

interesting, never thought about it with that direction in mind.
I always thought about committing to github triggering a deploy to RStudio connect.

Perhaps, committing from an RStudio Connect deploy would only reliably work if you're working alone, because if you're working together, there could be merge conflicts.

I might be wrong, but I think the usual way for deployment platforms (such as RStudio Connect, or Heroku or what have you) is to deploy from git commits, not the other way around.

cole · July 30, 2018, 8:38pm

Yes, I agree. I'd be curious to hear more about why @iain . Does deployment feel easier to do than commit? Are you looking for tying a specific version of deployed content to a GitHub commit (which the other approach would also allow)? Are you using git only locally and not with a server counterpart?

iain · July 30, 2018, 8:52pm

Sure - the main reason is that deploy is my current workflow. Committing and then deploying is an extra step and has the potential to be out of sync if the deployment fails. If connect could commit on successful deploy it would mean that everyone using our connect server would be using good code practice with no extra work on their part which is a huge win!

iain · July 30, 2018, 8:53pm

Agreed it might not be best practice for large teams but most of our shiny work is individual developers

Issue with github triggering deploy is that you have to have the infrastructure in place to allow it (ie own the CI system) and also have testing set up for the shiny app which is nontrivial

maxheld83 · July 30, 2018, 9:10pm

@cole some more details:

re: why `DESCRIPTION`

I'm kinda used to always having a DESCRIPTION around in any given project, because I always use Travis CI/CD.
I also use it in pkg dev, but mostly it's to talk to Travis CI.

re: `DESCRIPTION` vs. `packrat.lock`

packrat for build dependencies, DESCRIPTION for use-able dependencies

That's a great way to distinguish the logic and purpose of the two tools, thanks @cole!

I understand that packrat is vastly more powerful than DESCRIPTION -- really, it's tool for a different purpose, as you described.

I guess my fundamental hangup with packrat was always that it (programmatically) wrote out a (human-readable) packrat.lock file (if I remember correctly?), and that this needed to be committed, even though it was kinda derivative.
Not derivative in the way of man/ from roxygen2, but derivate from past install_x() calls, which packrat would magically track.
I'm a bit of a clean commit freak, and it freaks me out when there's stuff in my commit, which I haven't really written myself.

I guess the bigger point (aside from my commit gripe) is that in terms of UX, I prefer a dependency tool where I explicitly, and manually have to state dependencies.
To me, this makes it a lot easier to version-control, and, more importantly, reason about unexpected behavior.

(Committing the sources, as packrat requires, is also a concern).

I better understand now that while DESCRIPTION gives me that control-freak UX I like, it's not, per se, the right tool for the job.

follow-ups

Do you like just grabbing the latest version of all packages? Would you expect breakages/errors from picking up new versions on deploy? Would you be amenable to committing a file in your GitHub repository that specifies which package versions you are using locally for development?

For now, I get around this problem by using the Remotes: field in DESCRIPTION quite a lot, referring to individual releases or even commit hashes.

The problem in that scenario is my local machine, where I sometimes, lacking the graces of packrat, have to run a lot of devtools::install_github(ref = ...) to get my library in the proper state.

Are you familiar with GitHub hooks? Would you be comfortable with using a GitHub hook to drive deployment and bypass the Travis CI build server?

Great idea, thanks @cole. I'm familiar with the idea, but never worked with gh hooks so far.

Admittedly, I'm also a bit partial to the ecosystem around Travis CI and I actually use it in all my projects, even outside of pkg dev.
For example, I'll run shinytest on Travis against some app.
Or I'll deploy some static website from some *.Rmd from Travis.
So, unless RStudio would offer all these things (which might just be crazy scope creep), I might have a hard time weaning myself off of Travis.

I definitely love packrat for its vision and have become accustomed to some of the clunky UX. One piece you might look at using, if you are interested in tracking exact package versions over time is packrat::.snapshotImpl(".", snapshot.sources = FALSE) which just creates a packrat.lock file by parsing your code without any other "magic."

This is a great idea, I don't think I was aware of this when I last used packrat.

maxheld83 · July 30, 2018, 9:15pm

One more question @cole: Travis CI supports a bunch of deployment providers which I've found quite useful for GH pages, and Google Firebase etc.
They don't, afaik, do anything that you couldn't do with a custom bash script, and wouldn't address any of the above issues, but they just remove a little bit of friction and give users a clear path to deployment.

Is this something that RStudio might be interested in supporting? shinyapps.io, cloud.rstudio or even self-hosted products as deploy partners from travis?

I've considered just trying to hack this for myself, though I'm not familiar with the tooling and I guess the idea is that the service provider (RStudio) writes the deploy script, not some wannabe hacker .

cole · July 31, 2018, 1:56am

Very interesting thoughts. Regarding your well-articulated concerns for packrat:

I think this is a fantastic point. You are rightfully bothered by packrat's auto-snapshot feature and the other magic that happens behind the scenes. I, for one, never commit the packrat/src folder. I'm with you - committing sources makes no sense. There is a packrat option to gitignore packrat source that I always enable. The only things I commit are packrat.lock and packrat.opts. Setting auto.snapshot = FALSE will ensure that only snapshots you trigger will happen (i.e. no dirty commits).

Further, using .snapshotImpl(snapshot.sources = FALSE) will enable you to use the classic R library (.libPaths()) without the libpath magic of packrat. Again, the only updates to packrat.lock would happen when you .snapshotImpl. Granted, you will have to opt into the libpath magic if you ever want to restore the environment with packrat::restore(), but this could make for a decent dev workflow, and then again it is possible to edit the packrat.lock manually if you wanted to go that route too. (I explore questions about packrat ad nauseum, so apologies for the long response and feel free to read more).

Regarding deployment, the only piece it seems you are really accomplishing on Travis is deciding exact package versions (again you just pull the latest versions). I guess my point is that for deployment, what if you were able to commit a single file (i.e. something like packrat.lock) and then use that to deploy to Connect / shinyapps.io. In such a case, the Travis CI build would be redundant (unless you are running a suite of tests on a pre-deployed something or other) - the Connect server / shinyapps.io could take care of the CI environment reproduction.

In related fashion, I think this would mean that a deployment provider from Travis would likewise be a little redundant. I.e. it would be important to determine what we are getting from a Travis build that we could not get from the repo directly. If a sufficient packrat.lock-y file was provided in the repo, we would not need anything that the Travis build would provide. Thanks for sharing about those, though, as I was unaware - I am definitely interested to check into this stuff a bit deeper!

On the other hand, running shinytest or other things on Travis CI sounds like a really interesting blog post! I'm not sure if you have a blog of your own, or if you would be interested in submitting to RViews, but I for one would expect some interest there!

cole · July 31, 2018, 2:03am

Yes, I can see where you're coming from. Many R users are not familiar with git. I think there may be some ways to the solutions you envision without having Connect do the committing. I'll have to think on this a bit more, but I am hopeful that there may be a way to address making the commit process an easy part of a developer's workflow, as well as deploying without needing a formal CI system, shiny app testing, etc.

iain · July 31, 2018, 7:27pm

Great!

I would be interested to learn more why it wouldn't be a good idea to have an extra flag in the deploy command that will tell Connect to do the commit after a successful deploy based on the git configuration in the project.

cole · July 31, 2018, 8:29pm

I think there are several reasons. The short of the matter is that the .git folder is not published to the Connect server (and there is really no reason to include it). It would potentially make deployments much bigger, it would add a dependency to Connect (requiring git to be installed), and then even if Connect did have the ability to make a commit, there is no notion of a commit message, no way to get Connect to push the commit out to a server, no way for the user to control the commit behavior (git users typically have conventions around their commits), and, as was mentioned earlier, there would be version conflicts across users.

I think that brain dump is only scratching the surface of the challenges associated with such a feature. Basically, it would very much complicate the client / server interaction. It makes way more sense to do the commit on the client before / after deployment. All that you need then is a way of triggering a commit locally (there are R packages for that) and a tie into either the deployment process or RStudio to prompt the user for a commit message. This sounds like an RStudio Add-In to be honest. I think there is a way to accomplish your desire much more elegantly than the feature that you propose.

I'm dreaming of a (hosted) RStudio CI/CD

re: why DESCRIPTION

re: DESCRIPTION vs. packrat.lock

follow-ups

re: why `DESCRIPTION`

re: `DESCRIPTION` vs. `packrat.lock`