Where to save downstream binary outputs in a package?

maxheld83 · February 20, 2018, 8:41pm

Part of my package typesets (small) documents to all sorts of binary formats, via pandoc, latex and pdf2svg. These binary files (*.pdf, *.jpeg, *.svg) are then displayed in all kinds of places in package functions and also shown in a shiny frontend.

The details don't matter and are not interesting here. Suffice it to say, we have some binary blobs which:

are somewhat expensive to generate (yikes, LaTeX is slow!)
are somewhat large, but not prohibitively so (they can all stay in memory)
are entirely derivative of the R code used to generate them (they are not the source, they should not be put under version control)

I'd like to follow these best practices / have these features:

isolate side effects (to the file system, in this case) as much as possible,
cache the binary blobs, so I can inexpensively access them (instead of rebuilding them)
have this cache be invalidated, when the upstream R code changes.

My (preliminary) plan is to:

have some generate_bin() generate these blobs to tempfile() and then immediately read them back into R as blobs, maybe as blob::blob()
memoise generate_bin(), perhaps using memoise::memoise()
provide some function to write these blobs to disc, should users want that (so I can isolate the side effects).

Is this completely crazy?

Of course I could also just write the blobs to disc, and track the file names in the package, but that kinda scares me:

where would/should such blobs canonically be saved?
how can I cache them?
what if the user changes the working directory or some such shenanigan?

I'd appreciate any guidance or thoughts on this.

mishabalyasin · February 21, 2018, 10:19am

Your question is quite monumental, so I don't have a good answer.

There is a package called drake by @wlandau. It is not specifically for package development (at least I don't think so), but it has a lot of features that you seem to require (e.g., storage with hashing and other jazz).

maxheld83 · February 21, 2018, 11:05am

thanks @mishabalyasin for the drake pointer; that seems a bit overkill for me at this point.

I'm really just wondering whether it's generally better to store such objects as raw vectors in R, or as files on disc, and I'm leaning towards raw vectors.

hoelk · February 21, 2018, 1:15pm

If you save the files to disk, the canonical way would be to just use different tempfile()s. If you want persistent storage across R sessions, I don't think there is a way around a user defined cache directory (doesn't matter if you want to store blobs or .pdfs).

As for the rebuilding of the expensive blobs, it seams that something like make, drake or remake would be cut to your task (though I have little experience with them)

david2 · February 22, 2018, 11:18am

my first idea (and sorry if i'm misunderstanding) would be to use the inst/data/ dir to save them as whatever formats you like and you can inexpensively access anytime, add them to .gitignore to avoid version control, and as for the tracking changes - if i get you correctly - maybe you could store a hash internally and trigger a re-build whenever you need one? just my 2 cents...

maxheld83 · February 22, 2018, 12:07pm

thanks @david2; I should have phrased my question better.
The blobs are created by users at runtime, not at build time, so I am guessing saving in inst/data/ won't work here.

I'm going to go with memoising the function, and, perhaps later migrating to drake, if the whole thing becomes to slow (unlikely).

As the drake docs note, memoise::memoise() is kind of a poor man's cache, but it'll do for me for now.

maxheld83 · February 22, 2018, 12:08pm

thanks @hoelk; that's what I'll do, write to tempfile() and then read from it again.

memoise::memoise() should do for now, but in the long run, I might migrate to drake which seems really awesome and powerful.

hoelk · February 22, 2018, 7:03pm

Hmm if your working on a reusable package, memoise might even be the better solution. Drake surely has many features, but it also comes with lots of dependencies, which is something you might want to avoid.