How do you isolate packages environments for R data products in production?

slopp · August 22, 2018, 1:20pm

Another common question when we discuss RStudio Package Manager is how to manage multiple users installing different versions of the same packages. Consider, for example, the case where you’ve developed a Shiny application that requires ggplot 2.2.1, but for a new project your colleague wants to use the tidyverse which requires ggplot 3.0.

To understand the answer, it is important to understand the difference between a repository and a library. A repository contains uninstalled (though sometimes compiled) R packages. A repository can hold one or more versions of a package.

A library contains an installed package, tied to a specific version of R. A library can only contain one version of each package.

install.packages is responsible for taking a package from a repository and installing it into a library. The library command, in turn, loads a package into the R session.

To handle multiple versions of packages across projects, then, you have to create an environment where different projects use different libraries. To determine which library to use, R uses the function .libPaths, but using this function manually can be a pain. Most languages have tools for managing “libraries” (virtualenv, npm, etc) and R has a number of options including a tool called packrat.

A much more detailed discussion is necessary to handle the details of these tools including packrat. However, in many of my discussions I’ve had, it turns out that RStudio Connect meets the needs of the organization to isolate projects. RStudio Connect uses packrat and automatically handles package dependencies and isolating project libraries. If you need to deploy-and-forget shiny apps, R Markdown documents, or plumber APIs, start by looking into RStudio Connect. It may meet your needs even without RStudio Package Manager.

We’ve also worked hard to ensure RStudio Package Manager supports packrat as well as enabling other options for handling change control.

cderv · August 22, 2018, 1:33pm

It is true that this awesome. However sometimes you can't deploy to RStudio Connect server. (the app is too heavy, not compatible with the current infrastructure, ...). So you need to recreate this in another server.

Using packrat global cache mechanims is useful for that. We use it to deploy some app in their own server (that don't use docker). It helps a lot to deploy quickly, especially when no dependencies have changed.

In fact, we mix RStudio Connect in the process. The app is deploy in RSC in the datalab for testing. When ok, we take the bundle, including the packrat.lock and use this to deploy elsewhere without RSconnect. very handy !

In fact, It is not always simple to prepare a project to use packrat when it has not been develop with it. Especially when some package are coming from internal packrat. Packrat does not handle that well if it is not published somewhere accessible, and if the package was not installed from there. It would be awesome to be able to build the packrat.lock (or manifest) without all the automatic mechanism for packrat. In fact, RStudio connect help us with this.

cole · August 22, 2018, 1:49pm

I'm curious to hear more about this - what in particular restricts publishing to RStudio Connect? Resource consumption (too heavy)? What does it mean to not be compatible with current infrastructure?

If all you want is the packrat.lock file, packrat::.snapshotImpl(".", snapshot.sources = FALSE) will generate the lock file for you. You still have a little bit of weirdness that can happen relative to package installations (as you mention), but it is helpful to circumvent some of the packrat "magic." To your other point, we are working to improve the lockfile / manifest generation process so that it is easier to do manually without deploying to Connect.

cderv · August 22, 2018, 3:31pm

Yes first thing is the resource computation needed. These comments are also based on our current RStudio Connect Setup. It is what I mean by current infrastructure. The application to deploy is the result of the developer work and sometimes everyone does not have software engineering skills to design an application efficiently according to a specific setup. For exemple, We have a small server currently, that answer 90% of use case, and could provide a home to highly consuming app if the calculation can be sent elsewhere. (using async and future). If not, I need a server big enough for the app. RStudio connect is a centralized home for application, and sometimes is not the right home because it would degrade other app, or just the RStudio Connect setup can't evolve as rapidly as a dedicated environment. (For example, an application using spark would not be currently a good fit for our RSC instance, because we have to setup everything on the common setup, and it is easier to do it elsewhere.

This is mainly why we need in addition to RStudio Connect a good process to easily deploy and recreate environment on a server or dockerize environment. I use packrat::.snapshotImpl(".", snapshot.sources = FALSE) a lot, and modify it afterward if needed (cran repos url mainly). by build the packrat lockfile, I meant to be edit more easily the lockfile. (it is not as easy as a yaml file.) or to be able to add dependencies without packrat needed to scan the code. jetpack is experimenting such approach.

Hope it is clearer !

And...

Awesome!! waiting for that!

cole · August 22, 2018, 6:27pm

I see. Thanks for the background! It sounds like some of that is just organization-based stuff and not necessarily limitations of Connect. The only feature that I can think of that would really make this easier for you in RStudio Connect is more configurable app resource isolation. You may still be limited by the size of the box, though.

Very helpful! Thanks for sharing!