Dplyr dependency on BH slows installs

chasec · February 16, 2018, 12:24am

RE: dplyr, This may be naive, but is there any way to reduce the size of BH dependency/use an excerpt? I'm having to remove dplyr from a Shiny app I've made, because the BH package takes so long to install that I'm sure users will give up because they think it isn't working.

This might not a problem for regular R-users, but for building standalone apps that are to be installed naively, it can be an issue.

Also see:

nutterb · February 16, 2018, 12:21pm

Is there a reason you need to install the package each time, or would it be sufficient to merely load the package?

mara · February 16, 2018, 12:38pm

dplyr and BH are popular packages with sufficient time/value tradeoff such that many other packages depend on them, all of which is to say that it's unlikely that the package itself will change quickly.

BH is in dplyr's LinkingTo, which (as described in Hadley's R Packages book)

packages listed here rely on C or C++ code in another package.

github.com

tidyverse/dplyr/blob/18e0d91b2469cf88b428aa05990ffabe846c5aff/DESCRIPTION#L51-L52


      
          LinkingTo: 
              BH (>= 1.58.0-1),

Assuming your Shiny app timeframe is shorter than that of such a change, you might seek advice from the Shiny category, or look at some of the Shiny resources on modularizing, etc.:

https://www.rstudio.com/resources/videos/modularizing-shiny-app-code/

chasec · February 16, 2018, 3:22pm

Thanks Benjamin, you are right in that it is only the initial installation that takes so long; loading each time after is fast. The problem lies more so when the app is being installed by people not so computer-savvy.

chasec · February 16, 2018, 3:29pm

Thanks for taking the time in writing up such an informative response Mara, it's much appreciated! I'm making my way through adv R now, and R packages is next on my list, but that's a great teaching point!

I guess my question really lies less in the Shiny aspect, and more in whether dplyr relies heavily on many Boost C++ source libraries or just one or two and, if the latter, whether it would be possible/worth it to incorporate them directly into dplyr without requiring the entire BH library

mara · February 16, 2018, 4:21pm

Happy to help!

As for this part, unfortunately I don't know C++, or C, or RCPP for that matter either, so I can't be much help there.

There is a package Gábor Csárdi wrote (aptly named progress) that lets you add a progress bar to the R terminal. I'm not sure if it'd be possible to integrate beforehand, but thought I'd point it out— users (read: humans in general) tend to be much more amenable to waiting if they know how much of it they have to do!

tjmahr · February 16, 2018, 5:40pm

Yep, BH takes forever to install. I had a project where I used packrat to save and restore package versions to reproduce an analysis, and the longest part of reproducing the analysis was installing BH. I've definitely heard of the BH bottleneck slowing down some programming workshops. It's big but only updates infrequently.

chasec · February 16, 2018, 7:42pm

Yep, indeed it was packrat I had issues with too. BH was taking hours to install

cole · February 17, 2018, 3:12am

Not sure if you and @tjmahr are aware, but packrat has a global cache that you can use to get around this (on the same system, at least). The requirement is that you enable the global cache in your packrat.opts file and then start each packrat directory with a packrat.opts and packrat.lock file that point to the same version of BH that is already installed. Packrat will recognize that the global cache already has that version of BH installed and will use a symlink to tie that version to the new project.

This is basically the same as "re-using" an installed package, but gives you the benefit of working in a reproducible environment that treats every project as its own set of dependencies. (i.e. you get performance on package install without sacrificing reproducible dependency management and per-project version tracking).

More detail here:

hughparsonage · February 17, 2018, 3:24am

Workarounds that I've used:

Install binary versions of packages
Cache between sessions
Avoid dplyr if compile time is critical

On the latter point, there have been occasions where I've used a package for it's runtime performance and then been bitten by its install time. For example, for a particular project I chose readr::read_lines as it's much faster (at least in R < 3.5) than base R's readLines. However, the install time for readr gobbled up any savings for my purposes, so I switched back to readLines. Similarly, for dplyr, you may be better off using base R functions: it's harder, but obviously possible, and in your case it may be that the extra programming time is worth it.

chasec · February 18, 2018, 5:25pm

Thanks! I actually wrote the Shiny app mostly in base R with heavy use of the apply family, I recently re-wrote some of the program to use dplyr in thinking it would be easier to on-board future grad students in our lab into maintaining and improving the codebase. So it's not a huge deal as my use case is likely small (creating easy-install .exe Shiny bioinformatics apps). And the initial restore on a fresh-computer is the slow step in packrat, though global cache should definitely be considered for local-use.

However tjmahr made a point to a bigger use case- programming workshops, that's where more people will likely encounter issues and organizers/instructors should remember and plan for accordingly.

cole · February 19, 2018, 2:19am

Yeah, workshops are tough. In that case, http://rstudio.cloud may be a good (though in its infancy) solution! People can just create their own copy of your project, which may or may not use packrat for portability to local OS, etc. The spin-up-time is much faster for copies / forks than re-installing all the packages, though! Just not helpful if you are doing internal stuff for your company/institution.

As for dplyr, I think this discussion is useful if you end up needing other reasons to justify the rewrite to dplyr! Personally, I think the benefits are worth the additional dependencies.