Question on managing long and heavy R code (one script VS multiple pieces)

taras · August 1, 2018, 2:17pm

I am having a pretty long multi-stage piece of R code, which takes a lot of resources and time to run.
In the development stage, I broke it down into multiple stages, a separate R script for each stage. Each R script to start with loading an .RData file from the previous stage, and save another .RData file in the end for the next stage.

Basically, I'd first write the import part, save it. Then write the tidying part, save it. Then write the modeling part. Save it. And so on. It saved me a lot of time at the stages closer to the end.

Now that the code is almost done and ready to work, I am wondering:

if I should stitch it all back into one R script and remove the intermediate "checkpoints" (create one reproducible piece of code) VS keep it separate (since each stage produces some tangible result and potentially useful data, it would be easier for someone to pick up at any chosen stage instead of recreating everything from scratch)
in general, if such approach of breaking down long code is a good practice to begin with, and if there is a better approach.
if using .RData instead of cvs is frown upon (it was easier to save and later load .RData objects, as they preserve everything.

nutterb · August 1, 2018, 2:34pm

I do something similar with my computationally intensive scripts. I usually keep them in .Rmd format and will do something like the following, where the chunk does something that could take a long time, and then I save it to an .Rdata file. I leave a commented line for quick loading in my interactive session.

UsefulObject <- 3 # this is a very slowly assigned value of 3 :slightly_smiling_face:

save(UsefulObject, file = "some_file.Rdata"
# load("some_file.Rdata")

Whether or not I keep these all in one script or in multiple scripts varies from project to project. It has more to do with organizing my thoughts than anything else. If I feel like the tasks are distinct enough that I would understand the flow better by isolating them in different files, I do that. If I feel like isolating them will cause me to be toggling back and forth between multiple files, I keep them together. Admittedly, as a subjective process, I may make different decisions on different days.

EDIT: I should point out that the reason I use the .Rmd is to use the cache chunk option. Caching the chunk allows me to run the script, and then that chunk won't run again unless I change something in the chunk. It's effectively the same thing as isolating .R scripts with the option of keeping them all in one file.

Fer · August 1, 2018, 3:26pm

I normally keep different parts of the analysis in different scripts, for example, one for data formatting / preprocessing functions, another for some analytical functions, other for the plots functions. Then I have an Rmd file to work wiith the data ina similar way as mentioned by @nutterb
What I avod is using .RData for saving parts of it. Instead, I use .RDS format. Of course, carefully use or .RData may be suitable, but in the long run it may generate problems.
For example, you may want to load an old file in your current session to do some new stuff and compare with other stuff, and it may actually overwritte it. Working with .RDS files forces you to have to associate the file to a new object, and to give it an explicit name:

load('oldfile.Rdata') # it may load 'something' forgotten
## vs
oldfile <- readRDS('oldflie.RDS') # whatever it loads, it is associated with a new object

Also, assuming you are working mid/long term, your functions will also be improved with time. If you save some sessions (the full workspace) long time ago, and then you load them into a current one, it may overwrite the current functions with the old versions.

Of course, this can be avoided with some care, but I feel the RDS strategy as a safer one

Leon · August 1, 2018, 4:12pm

Great question @taras! I gave a talk on this very subject, which I condensed into this overview.

Hope it can serve as inspiration and I will be happy to answer questions

cderv · August 1, 2018, 5:15pm

Hi @taras,

Have you thought about {drake} to organize your big project workflow?

There is also similar tools like {remake} to get makefile-like in R.

I find it pretty useful to organize some not-so-small projects. {drake} will also help with heavy computation (parallel ready) and saving intermediary results (If I recall right.). With the dependency graph, you won't have to rerun everything that did not changed. Very well documented also! You should give it a try if you think it suits your use-case.

yonicd · August 1, 2018, 5:40pm

@taras you can also move from long file with multiple functions to multiple files of single functions with sinew::untangle.

This is great for when you are trying to move from an idea that you tried out in a single file and break it up into an R package.

Any script in the body of the original single file that is not a function will be kept in an additional file called body.R.

This can let you manage more compartmentally the different parts of your script and get you on your way to making the workflow into a self contained package.

taras · August 1, 2018, 6:53pm

I like it! I haven't tried it yet, but looks like it is time to check it out!

taras · August 1, 2018, 6:54pm

I'm going to definitely check it out!

adamk · September 19, 2019, 5:55pm

This is a perennial problem of mine so I thought I'd chip in, over a year later. I see both an involved and a minimal (though still effective) solution to this.

I think drake is my dream, but involved solution. My first crack at trying it though gave me the impression I would have to change the way I write scripts because it relies on calling functions on your data. I'm not an expert on that - it might be able to just source your scripts one at a time. I just feel there's a bit of a paradigm shift involved.

Until you or I get there, I think the minimal solution is to give yourself some insurance against confusing things by documenting what the inputs and outputs are of each script in a README file. For example:

Step 1: dataset1.rds -> script1.R -> dataset2.rds & dataset3.rds
Step 2: dataset2.rds & dataset3.rds & dataset4.rds -> script2.R -> dataset5.rds