Best practices for location of library() in .Rmd files

jtr13 · October 27, 2017, 11:11am

I struggle with where to put library calls: in one code chunk at the beginning of the file or as needed in individual chunks? Is there a recommended best practice? Thanks!

DavidB · October 27, 2017, 11:43am

I can't say whether this is best practice, but my practice is to put them in one code chunk at the beginning. At least, that's what I try to do. Inevitably, I will forgot some packages that I need and then I will write a library call in the code chunk that I happen to be working on at the time I realise I need it. This often ends up causing problems later, for example, by removing or not evaluating the chunk with the library call, with the result that some code later in the file fails to run. So, FWIW, I would recommend putting library calls at the beginning, but I'd be interested to learn if there is some good reason not to do this.

ConnorKirk · October 27, 2017, 12:49pm

Placing your library() calls as needed in your markdown chunks certainly fits the flow when initially writing your markdown documents. Consciously moving the library() calls together to the top of the document will make it easier for future users of the document to easily see what libraries are used.

The Google R style guide recommends placing your library calls together

General Layout and Ordering
If everyone uses the same general ordering, we'll be able to
read and understand each other's scripts faster and more easily.

        1. Copyright statement comment 
        2. Author comment
        3. File description comment, including purpose of
          program, inputs, and outputs
        4. source() and library() statements
        5. Function definitions
        6. Executed statements, if applicable (e.g.,
           print, plot)

PirateGrunt · October 27, 2017, 12:49pm

I'm another one for putting all the library calls at the beginning. This lets me know whether I've got namespace issues before I run any code that might be affected. I'm looking at you 'dplyr::lag'.

edgararuiz · October 27, 2017, 12:57pm

To me it depends the intent of the RMarkdown. If it is to teach how to do something in R, then I opt for loading the library as close as possible to where I'm going to use it so it's easy for the reader to follow where the functions came from. But if the document is intended as a report then I load all of the libraries at the top of the RMarkdown, that way I'm free to change the order of the chunks without having to worry if a needed library has already been loaded.

nick · October 27, 2017, 1:09pm

One other issue with loading them later is that, if the late-loaded package creates any conflicts, your earlier code may not re-run unless you restart your R session each time (or unload the package). It's not something I run across as often these days, but when I was pulling in a bunch of packages for one function each (and didn't have the habit of using the package::function() notation when I did), my earlier code would fail. I think there was at least one time that it just produced a different result, because the conflicting function was still valid code but produced new output.

jtr13 · October 27, 2017, 1:40pm

A "best of both worlds" option here would be to have all the library calls at the beginning and then commented out calls in the chunks so users know what's needed.

pssguy · October 27, 2017, 3:05pm

I tend to like that idea. But, pretty please, could we have more than one colour for comments
so that this sort of usage gets differentiated from other types of comment

alexilliamson · October 27, 2017, 4:07pm

I put all the library() calls in a separate script. Then one of my first chunks in the .md file is something like:

source('code/00_dependencies.R')

Now I'm nervous there is some obvious drawback to this approach that I've missed

ConnorKirk · October 27, 2017, 4:10pm

An alternative method to demonstrate what libraries are needed in chunk would be to use the library::function() notation in the chunk itself. Combined with having your library() calls at the top of a script this could be the most clear for other readers, though maybe a little verbose for users familiar with the topic

mara · October 27, 2017, 5:06pm

I do this as well, and, if for some reason (conflicting function names, etc) I load the library earlier, I'll harken back to the fact that it was run, and explain the package right before. But, I've found that (especially when they're non-core tidyverse s) it's a good way to show why X package is useful (e.g. stringr, lubridate etc.)

mara · October 27, 2017, 5:08pm

I think this just makes for difficult reading— especially with packages like dplyr, and tidyr where the "verbs" (functions) will have meaning to someone new to R.

jtr13 · October 27, 2017, 5:27pm

I think the intention here is namespacing everything, such as dplyr::mutate, so the meaning is still there. I am trying to do this consistently with all functions expect the really common tidyverse ones. I would, though, namespace tibble::add_column( ) since imho it's not obvious that it's tibble and not dplyr.

ConnorKirk · October 27, 2017, 5:33pm

That’s a good point. I had thought of it as a solution to the problem of “where is the function from?”, but in the grand scheme it may be too verbose and actually lose the value it was meant to add.

I suppose that generally it’s best to place the ‘library()’ calls together at the top, but for specific documents and audiences it may be better to include them as needed (for example in an educational context).

jennybryan · October 27, 2017, 7:11pm

I tend to always put them at the top for .R or .Rmd, except for the expository case described by @edgararuiz.

But sometimes I do add a comment mentioning why I load this package, especially if it's for one-off use of a specific function, e.g.

library(lubridate)             ## for guess_formats()

This way you also have a better chance of noticing that you're loading things you don't need to, e.g. if one day the script no longer calls lubridate::guess_formats().

ildi.czeller · October 27, 2017, 7:43pm

source("code/00_dependencies.R")

I do this as well. Mostly because typically I use a few packages but I use them a lot in different functions and chunks. If I use a package only once or twice I won't load the package at all neither beginning of script nor right before usage but use packagename::functionname instead. I find this useful combined w packrat: I have a few packages used in a project than I can source my_libraries.R
from any rmd or other script within that project.

jtr13 · October 27, 2017, 8:57pm

So... this all leads me to wonder: couldn't the library calls be automated? I.e. Rstudio detects when a function is used from a package that is not loaded, and then adds it to a code chunk on top with all the library calls. Or adds it to the current chunk if you have the "add library() to current chunk" option clicked. If the function isn't recognized, you get a message. And of course if there's a conflict, you get a pop up that asks "Do you want select from the dplyr or MASS package?" Seriously, this is a task that a computer could do very well, and a human (at least this one...) cannot.

Tazinho · October 28, 2017, 6:20am

Yes, there is at least base::autoload(). I think I have seen also another pkg with this functionality, but can‘t remember now.

I usually also call library at the top. However, when I do this very late, I pay a lot of attention to any conflict messages (contrasts), to not break any existing code. In case of any risks I use package::function notation.

mara · October 28, 2017, 2:07pm

I think it'd be a fairly easy extension of @milesmcbain's deplearning

jonocarroll · October 28, 2017, 11:23pm

Do you mean packup? https://github.com/MilesMcBain/packup

It does what @jtr13 is suggesting already.