Packaging external data sets: to `load()` or not to `load()`?

Bradley · May 4, 2019, 7:48pm

I'm working on an R package (https://github.com/bradleyboehmke/completejourney) that provides access to real world retail transaction data (a total of 8 data sets) that have been used to train data scientists at my company and also in a few universities.

A few of the data sets are too large to reside inside the R package (CRAN won't accept pkgs over 25MB). Consequently, we took a different route by providing a function (get_data()) that will download one or more of the data sets from GitHub. Rather than save the downloaded data sets as a list of tibbles, we have get_data() saving each data set as a tibble in the users global environment.

We could not find anything in the official R documentation that states you cannot do this and in our help documentation we clearly state that get_data() will save the data set in the global environment. However, during the submission process, we had two initial submissions where the CRAN reviewer did not have any concerns with this but on the third submission a different CRAN reviewer raised concern and stated "Please do not modify the .GlobalEnv and just return a list of loaded objecs." A few questions:

Is this considered bad practice (loading data to the users global environment)?
Is it worth pushing back on this third CRAN reviewer?
If pushing back on the CRAN reviewer is not an option, I'm looking for alternatives to downloading 8 separate data sets as a list. This then requires the users to parse out the individual data frames separate if they desire. The obvious alternative is to have them download each data set individually, which is tedious and, since this package is heavily used for educational purposes, many of the folks using it may not be educated on functional programming options (i.e. lapply, purrr) to simplify. Rather, I'm trying to make it as convenient for them as possible.

mishabalyasin · May 4, 2019, 8:28pm

I'm not that versed in CRAN-ways, so nothing concrete to suggest there, but as a workaround, you might consider using zeallot package (https://github.com/r-lib/zeallot). It allows you to assign multiple elements at once:

c(x, y) %<-% c(0, 1)
#> x
#[1] 0
#> y
#[1] 1

So your get_data can still return a list of 8 elements, but you can also have an example with the syntax for your problem:

c(data1, data2, ..., data8) <- get_data()

rstub · May 16, 2019, 8:45am

I would use a different route:

Package the data as normal R package, e.g. completejourneydata and make it available via your own repository. This is easy when using drat.
Add the data package as Suggest to completejourney.
Add your own repository as Additonal_repositories to completejourney.

This technique is described in detail in an R Journal article and used at least by the CRAN packages hurricanexposure and swephR.

This way users do not have to download the data multiple times.

Bradley · September 13, 2019, 1:27pm

Just for completeness, here was my final approach implemented in the completejourney package that provides retail shopping transactions, which is now on CRAN.

Most of the data sets were not too large so I could fit them entirely in the package. For the two data sets that were too large to include, I provide a downsampled version within the package and also functions (get_transactions() and get_promotions()) to download the entire data sets from online.

I also included @mishabalyasin's suggestion by including an option to efficiently download both data sets and assign them as separate data.frames:

c(promotions, transactions) %<-% get_data(which = 'both')