How to work around user memory limitations when developing R package

dreamon · September 4, 2022, 3:27am

I'm developing an R package which needs to handle some very large data sets, and I'm looking for advice on how to best deal with situations where the input data exceeds a user's memory.

The package helps users download data sets from an online source, then automatically parses and reshapes them into a consistent format. The data is provided as zipped CSV files, which the package downloads, unzips, then reads, cleans and reshapes by way of tidyr::pivot_longer. The function users interact with looks something like this:

get_data <- function(url, ...) {
  zip_file_path <- .download_file(url, ...)
  csv_file_path <- .unzip_file(zip_file_path)
  tbl           <- .parse_data(csv_file_path, url = url)
  return(tbl)
}

Unfortunately some of the data sets have gotten so large that they cause the pivot operations to eat up all memory, crashing R on machines with insufficient RAM (<32GB). I'm trying to figure out a way to allow users with moderately specced machines to use the package, but none of the options seem ideal.

Using reshape2 instead of tidyr unfortunately doesn't prevent hitting memory limits.
I could allow users to feed an expression to get_data() that could be used to subset the data prior to running the pivot operation. Unfortunately this is tricky to implement and would require that I pass the expression to the inner .parse_data() function (which has many potential pitfalls, as Hadley Wickham explains here). Users would also need to know the structure of the data to formulate subsetting operations.
I could make pivoting optional by allowing users to specify an auto_pivot=FALSE parameter. This way, get_data() would take care of downloading, unzipping, reading and cleaning the data. Users could then subset the data themselves before feeding the result to a new pivot_data() function. Unfortunately, this makes the package more complex, somewhat defying the point of having a single convenience function.

FactOREO · September 4, 2022, 6:00am

Could you implement an option to download just a subset of the data, so that the user can use purrr::map() or lapply() to use the function on chunks of the data?

If you only reshape the data, giving an option to do it in chunks would probably one of the easiest solutions.

mgirlich · September 4, 2022, 6:16am

You have a couple of options:

Use vroom to read the CSV files. It uses ALTREP which avoids allocating much memory.
Subsetting the data is now usually relatively easy thanks to the tidyverse consistently using tidyselect. You could let the user "explore" the data with a separate function by just reading the first 20 rows or so and returning them.
Then you could add an argument cols to get_data() and use it like get_data(url, cols = c(id, x, y, value)). In vroom you can specify which columns to read vroom(path, col_select = cols).
You could use DuckDB + dbplyr. DuckDB can handle bigger than memory data. See this article.

dreamon · September 5, 2022, 11:10pm

Importing only a subset or data, or only the first couple of lines, seems like a sensible compromise. vroom also looks very promising. I'll give these options a shot.

Fantastic suggestions. Thank you so much!

jimbrig2011 · September 28, 2022, 10:15pm

I've had success in the past using the bigreadr package for dealing with data like this.

Today I tend to use the qs (quick serialization) package with vroom and purrr implementing some caching helpers: see jimstools/utils-cache.R at main · jimbrig/jimstools · GitHub

dreamon · October 1, 2022, 1:40am

I'll give it a try. Thanks for the recommendation!

Flm · October 1, 2022, 10:45am

This video could be helpful Using the archive R package to read and write tar.gz and other archive files (CC250) - YouTube

system · October 19, 2022, 3:28am

This topic was automatically closed after 45 days. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.