How do you structure your code to add checkpoints/saves for bits of code that take too long to run

taras · March 28, 2019, 1:15am

TL;DR read the topic.

To expand: what are some of the best practices to inject some checkpoints into your code to avoid rerunning the bits of code that don't need to be re-run every time and take too long to run?

I tried breaking the script down into separate .R files, with each .R file ending with a saveRDS() call and each new file starting with a readRDS() but it doesn't work well (at least not for me).

I also tried to insert some user input with something like askYesNo(), but I don't like the added complexity each step of the way.

Any other common practices?
Thanks!

Fer · March 28, 2019, 9:45am

I normally have the parts that consume most of the time or most of the script (like long code for custom plots) into functions in a different script/scripts.
Because I have to run the code to several datasets that have their own folders, at the end I put any analysis in the same folder, using an ID of each dataset.

In the interactive scripts the data is loaded on the same way with the ID. So my file starts with something like:

ID <- 'aa1'
analysis <- 'gamm'


## Reference for the directory containing the processed sata
floc <- paste('~/data/',analysis,'/',ID, sep = '' )
## Reading Data
Data <- readRDS(paste(floc,'/',ID,'_',analysis,'.RDS', sep = ''))
 ....

In this way, I keep working not in a workspace where very 'expensive' data lives (expensive computationally speaking, from weeks to months worth of dedicated desktop cpu) and I can access the files very easily (I just need to write which data-set and analysis I want). This may look a bit tedious to code, but is very easy to use (and easy to code), and it keeps me sure that I have one and only one file with the desired processed data-set, and in one and only one folder in use (I keep copies but never accessed to them with R). I process/analyse the different data-sets using a bash script to call R scripts, so also, if I modify something, it will apply to all data

You will get an error if the file does not exist, but alternatively, and if you work interactively most of the time, you can just have some code to check if the file has created or not (and them, make it) with the the function file_test, like:

if (file_test('-f', 'yourfile')){
    Data <- readRDS('yourfile')
} else {
    'analyse your data / generate yourfile'
}

What problems do you have with the RDS files?

cheers

taras · April 4, 2019, 3:53pm

I don't have a problem with RDS files per se, I just don't find this experience to holistic...
I probably just need to code it better and code up some functions in store them in separate R scripts.

grrrck · April 4, 2019, 6:07pm

drake is really amazing for exactly this kind of workflow and I've come to recognize data caching via readRDS() as my personal code smell for when I should use drake. I used to do something very similar, using a complicated system of basically hand-rolled object caching built around simpleCache (which is a great package as well). But with minor modification to your workflow you can just let drake handle all of the caching.

Here's a quick sketch of how you'd adapt the current work flow:

Wrap relevant code in your scripts into functions, making sure that at a minimum the functions input and output data at the critical steps, i.e. whenever you would have called saveRDS() or readRDS().
Create a drake plan to, in essence, track the inputs, dependencies and outputs of each step.
Having done this, you can run the full analysis with make(). Or if you only want certain parts of the analysis you can make specific targets (target = step in analysis) and drake will run just the parts of the analysis that are out of date and required to get that target.

There's a lot that more that drake can do and the documentation is excellent!

taras · April 4, 2019, 6:11pm

People have pointed me to drake before, but I kind of brushed it off as too complex for my simple needs. I feel like I need to invest a couple of hours into reading up on it and implementing it.

grrrck · April 4, 2019, 6:20pm

Me too! @wlandau did an excellent job with the documentation and the examples in the drake book are both approachable and instructive. Of course there are plenty of rabbit holes you can go down if you want but getting to the point of being able to use drake in my day-to-day was a way easier than I expected.

system · April 11, 2019, 6:20pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.