Best practices...pacman and rio questions

davidr · November 19, 2017, 3:46am

I always install and load packages with 'pacman' and about 95% of the time I import/export data using 'rio'. But, I don't see many people using these packages in their R blog posts, or recommending these packages to users--especially new users--or in the example code posted here or on stackoverflow. I am curious about why this is the case. Are these packages not consistent with best practices for R coding? Do they reduce reproducibility of R scripts? Is there something else I'm missing?

I think these packages are great, particularly for new users. 'pacman' reduces the number of lines of code dedicated to installing and loading packages compared to base R. 'rio' allows my students to remember only the "import()" and "export()" commands to get almost any data file type into and out of R, compared to remembering, for example, read.csv, read_csv, or "fread()" just to import CSVs.

As I encourage my students to write more efficient R scripts and promote best practices, I find myself reflecting more on these issues with my own approach to writing R code.

Thanks!
David

dylanjm · November 19, 2017, 3:56am

In my personal experience I've stopped relying on library(rio) and import() I thought it was the most amazing function ever until I ran into an issue dealing with a script that was trying to change a column to a DATETIME data type. I realized that import() uses the underlying function fread() for .csv files and it was messing everything up. Yes, I could've re-written the code to accommodate rio but it was frustrating.

Also, I think jumping straight to import() could be a bad idea when teaching students File IO. You have to know what import() is using as its underlying functions and be familiar with how they handle objects, and their specific parameters. I've had students jump straight to import() and never figure out that they can use the parameter skip during their read in.

It would be nice to have an IO wrapper built around the tidyverse functions. Something like tidy_io()

davidr · November 19, 2017, 4:44am

Agreed, a tidyverse IO wrapper would be ideal given the number of import/export packages it has anyway. Most of my colleagues use SPSS and my lab uses Qualtrics for all of our research. Most of raw data gets imported (and exported then shared with others) as an SPSS .sav file so rio's use of data.table::fread() over readr::read_csv() doesn't impact us too much. I do prefer read_csv() over fread() for csv files.

nick · November 19, 2017, 2:13pm

In theory, I like the pacman style for sharing scripts that "just work" regardless of installed packages. However, there are a number of issues that end up discouraging me from using it in day-to-day work and code sharing:

If someone isn't familiar with pacman and is relatively new, it's yet another thing they need to understand in my code.
By using it, I'm "forcing" someone to install one or more packages, and adding an additional dependency to my script. This is relatively minor, but if someone doesn't want to use pacman generally themselves, I don't feel like I should push the issue by making my script require it. This could be overcome by using a modified version of the standard header where you only use p_load if pacman is installed, and library lines otherwise, but that could add to any confusion.
For newer coders, I think the default of suppressing package startup messages (IIRC) will make them miss potentially important information.
Listing a package that doesn't exist produces a warning, not an error.
If I've installed a package from Github or elsewhere and don't remember that, adding the package to p_load doesn't work. Combined with the previous point, the script won't actually fail until I get to a function that failed to load, instead of at the library statement.
The more compact nature makes it slightly less likely that people will remove packages that the script no longer needs (though this is obviously subjective).
It somewhat discourages people from considering the fact that every time you run install.packages you are potentially making a breaking change. If the script is being deployed, it's essential that you don't assume that packages can just be downloaded as needed -- instead, something like packrat is necessary.

The last four issues just recently came up for me with a set of scripts from a consultant that my company is getting ready to deploy. They list 20+ packages, a couple that aren't on CRAN, at least one or two that don't exist, most of which aren't actually used in the given script.... While pacman isn't the most significant problem, I suspect that not using it would have led to at least marginally better code.

davidr · November 22, 2017, 3:09pm

@ nick - lots of interesting points, some of which I haven't considered. I spend a lot of time working with my undergraduate and graduate RAs. I find the UGs are, for whatever reason, very keen on just doing a select-all, then running a script repeatedly...I don't get it at all. I end up sitting there with them for several minutes as they re-install the 'tidyverse' for the hundredth time if they use install.packages() rather than p_load() which doesn't reinstall or update a package if it's already installed. No matter how many times I tell some students this, they just refuse to listen or learn. If you have any tips for teaching your 7th point to avoid this, I'm all ears.

For suppressing package messages, I assume there's an option for that. But, I always test new packages I add to a script using install.packages() then once I've evaluated the conflicts I will, in subsequent scripts, I go to p_load() for simplicity. I understand how this might not be an apparent approach to a new user though. At the same time, package conflict errors tend to panic new users as well.

nwerth · November 22, 2017, 7:59pm

The only time I ever include environment-changing code, like install.packages(...), in a script is if that's the only purpose of the script. For example, a script to set up a standard environment for new users.

Environment changes shouldn't happen in scripts that do something else. If a user tries to load a package and finds it's not installed, they should choose whether to install it. Even if they're "newbie" users, the message Error in library(foo) : there is no package called ‘foo’ is not something to shield them from. They'll definitely see it again and need to know what it means and how to "solve" it.

If the script is used in production, whoever handles that environment really doesn't want your script changing it.

pgensler · November 22, 2017, 8:11pm

I don't think that there is anything you are missing, because you hit the issue spot on:
Even with the ever-growing package ecosystem, it can be somewhat of a hassle to easily get everyone up and running. I'm a huge advocate for pacman myself, but I have come to the point where I really wanted a sandbox for R to easily share with others, and really make it platform-agnostic. I think the main advantage with Docker is that it allows for others to have quite a bit pre-configured, so it's faster to get people up and running, and not worrying about setup woes. I would encourage you to look at Docker with R, as I think that can really help to simplify 'who does not have package x loaded' to focusing more on diving in with R.

This is a sample course that uses docker to set up R with RStudio for use within the browser:

If you would like more details on how to get setup, I have put together a blog outlining how to get started with Docker, end-to-end. Hope this helps.
https://medium.com/@peterjgensler/creating-sandbox-environments-for-r-with-docker-def54e3491a3