I am looking for best practises on package dependency management as I am going to deploy a shiny app. My goal is to have an infrastructure which allows me to easily develop new features locally, have these automatically tested (e.g. Gitlab CI) and deployed to a server. Docker seems to be a good choice for this task. My question is mainly how I should take care of the packages my app depends on.
Options:
checkpoint: Only works with CRAN packages and not local packages and there seems to be a bug regarding setting the repos option: Checkpoint Issue
Microsoft R Open: Similar to checkpoint and a bit too fixed on old package versions without the chance to use newer package releases on CRAN or Github.
packrat: I have tried this several times now, it installs packages for a few hours! and then fails if my code includes local packages. And I did not really understand what benefits it could offer me to just change .libPaths via a project-specific .Rprofile to a library folder in the project, so all packages are just installed there (this is the approach I chose so far). Then I would copy this folder into my docker image and have all dependencies in place?
Should I commit the library folder with all the package dependencies to my Git repo? Then I could have different versions of packages for different feature branches. Are there any drawbacks?
Would really appreciate some feedback, because the right choice for dealing with package dependencies will probably save me lots of time in the future during development and deployment.
I'm not sure exactly what context you are working in, but this is one of the problems that RStudio Connect was designed to resolve. It solves the package dependency / reproducibility problem using packrat and the blue publishing button in the IDE to handle all of the "dirty work" for you (much like shinyapps.io if you are familiar). It does not use docker per se, but behaves in a very similar fashion sandboxing content / etc.
If you do want to attack the problem directly in docker, this article may be useful to your quest as it discusses (the surface of) many of these options. packrat is probably my personal favorite, but it definitely has its drawbacks (one of which is building all of the packages from source). Some other gotchas to beware of:
Operating System differences between your docker image and your host, especially if you are "mounting" the packages into the docker image and not installing them into the image directly
local package installation with packrat requires a CRAN-like repo. Alternatively, you can host the packages on git (packrat knows how to handle git installations from remotes::install_github or devtools::install_github. Again, if you are in the enterprise, RStudio Package Manager may be a good fit. Otherwise, the miniCRAN and drat packages can get you going
One of the main benefits packrat provides is the packrat/packrat.lock file, which allows you to "commit" / track your exact package dependencies at a particular moment in time (across local, GitHub, and CRAN packages). Some of the other semantics can get pretty messy to deal with (again, building packages from source can take a long time), but that packrat.lock file is gold
You're definitely in a good place for this discussion, though! I know there have been several related discussions of reproducibility and docker on Community lately, so it might be worth searching for some of those!
Thanks for your reply! I would probably go with one of the pro solutions like RStudio Connect (looks great), but it is a non-commercial project, so I have to figure out a different way.
packrat would detect and install all packages used somewhere in the project, right? Because currently I have lots of R scripts which are not directly relevant for the current version of the shiny app (only packages loaded in global.R are relevant) and it is not necessary that all of those will be installed when I deploy the app. The build of the docker image should not take too long, otherwise it will be difficult to quickly fix bugs etc.
Do you have tips on the handling of private packages? Currently I depend on an R package in a private Github repository and I need a safe way to pass the credentials. If I use Git tags to mark specific package versions will packrat be able to install the correct version from Github?
Sorry for the late reply here. I'm actually uncertain how packrat's installation of private packages works... I know it does record the SHA of the commit it installs, so it should be able to install the correct version from GitHub. I'm just not sure what hooks you have into the authentication process for the download. I think it uses a function like devtools::install_github(), so I suspect it should work as long as you can install the package from GitHub that way. Otherwise, I would say definitely open an issue in the packrat repo.
About having too many dependencies, etc. - one option that may benefit you in packrat, at least, is:
ignored.directories: Prevent packrat from looking for dependencies inside certain directories of your workspace. For example, if you have set your "local.repos" to be inside your local workspace so that you can track custom packages as git submodules. Each item should be the relative path to a directory in the workspace, e.g. "data", "lib/gitsubmodule". Note that packrat already ignores any "invisible" files and directories, such as those whose names start with a "." character. (character; empty by default)
ignored.directories is exactly what I was looking for!
This is now my workflow:
Include mypackage as a Git Submodule
Locally:
Add mypackage to ignored.packages
Run packrat::init
Create the packrat.lock file using packrat::snapshot
Inside docker:
Copy packrat.lock, .Rprofile and packrat.ops files to docker
Run packrat::on()
Run packrat::restore()
Run packrat::install("mypackage")
There are a few issues though.
The first one is that I had to made some minor changes to packrat::restore as this fails due to mypackage even though it should be ignored. I will file an issue about this.
The second thing is that packrat::install("mypackage") fails if not all dependencies have been installed previously. I thought it should install all dependencies by itself, not sure what is happening there. But this is not a problem because the dependencies will be in the packrat:.lock file if I do not choose to ignore the folder where the Git submodule is in.
I think what you're running into here is that R wants to install packages from a CRAN-like repository. I think this is the only time that the dependency resolution happens. Also, since mypackage is actually a dependency for your project, that is why the restore is failing (i.e. ignored.directories is just preventing packrat from reading that directory for packages). If mypackage is still in your lockfile, then packrat will yell at you if it is not installed.
If you want to ignore a package dependency / leave it out of your lockfile (i.e. mypackage), you are looking for "external packages" (again from ?`packrat-options` )
external.packages : Packages which should be loaded from the user library. This can be useful for very large packages which you don't want duplicated across multiple projects, e.g. BioConductor annotation packages, or for package development scenarios wherein you want to use e.g. devtools and roxygen2 for package development, but do not want your package to depend on these packages. (character; defaults to Sys.getenv("R_PACKRAT_EXTERNAL_PACKAGES") )
If you are really wanting to treat this package like a normal package, have the version tracked, and have dependencies resolved, you might be well served to create your own CRAN-like repository. This need only include your custom packages, as the rest can be from a CRAN mirror. The drat package can be helpful for maintaining this tiny little mini-repo, and you can always reference it with something like: