I'm developing an R package which needs to handle some very large data sets, and I'm looking for advice on how to best deal with situations where the input data exceeds a user's memory.
The package helps users download data sets from an online source, then automatically parses and reshapes them into a consistent format. The data is provided as zipped CSV files, which the package downloads, unzips, then reads, cleans and reshapes by way of tidyr::pivot_longer
. The function users interact with looks something like this:
get_data <- function(url, ...) {
zip_file_path <- .download_file(url, ...)
csv_file_path <- .unzip_file(zip_file_path)
tbl <- .parse_data(csv_file_path, url = url)
return(tbl)
}
Unfortunately some of the data sets have gotten so large that they cause the pivot operations to eat up all memory, crashing R on machines with insufficient RAM (<32GB). I'm trying to figure out a way to allow users with moderately specced machines to use the package, but none of the options seem ideal.
- Using reshape2 instead of tidyr unfortunately doesn't prevent hitting memory limits.
- I could allow users to feed an expression to
get_data()
that could be used to subset the data prior to running the pivot operation. Unfortunately this is tricky to implement and would require that I pass the expression to the inner.parse_data()
function (which has many potential pitfalls, as Hadley Wickham explains here). Users would also need to know the structure of the data to formulate subsetting operations. - I could make pivoting optional by allowing users to specify an
auto_pivot=FALSE
parameter. This way,get_data()
would take care of downloading, unzipping, reading and cleaning the data. Users could then subset the data themselves before feeding the result to a newpivot_data()
function. Unfortunately, this makes the package more complex, somewhat defying the point of having a single convenience function.