best practices converting a series of scripts to drake

StasK · March 12, 2019, 3:20pm

I have discovered (the existence of) drake, and am contemplating conversion of my projects that are heaps of R scripts and Rmd documents, with some that have simple makefile like

# build the whole project
all: merged.Rds report.pdf
# process the input data
merged.Rds: merge-data.R file1.csv file2.csv
    Rscript merge-data.R
# compile the output file from Markdown
report.pdf: merged.Rds report.Rmd
    Rscript -e "rmarkdown::render(‘report.Rmd’)"
# clean up -- delete unneeded files
clean:
    rm -fv merged.Rdata

This would seem to correspond to a drake_plan like this:

plan <- drake_plan(
  merged_data = source("merge-data.R"),
  report = rmarkdown::render(kintr_in("report.Rmd"))
}

However this syntax loses the dependency of merged_data on file1.csv and file2.csv. In high brow terms, we call source() for its side effects of creating files, declaring functions (and drake_example("main") is arguably guilty of that too as one has to call functions.R exclusively for that sort of a side effect).

I thought of a very klunky fix along the lines of

plan <- drake_plan(
  merged_data = withAutoprint({
      readLines( file_in( "file1.csv" ) )
      readLines( file_in( "file2.csv" ) )
      source("merge-data.R")
      saveRDS( merged, file = file_out( "merged.Rds" ) )
    }),
  report = rmarkdown::render(knitr_in("report.Rmd"))
)

with the expectation that knitr_in() will figure out that report.Rmd loads up the merged file. But drake::make() stumbles upon source()... of all things... and cannot find the source for that.

I want to convert stuff with minimal effort, and without really doing anything inside the existing scripts, so that my collaborators could continue running them as is.

I reasonably expect that @krlmlr has a very good answer to this

P.S. Toy example:

library(here)
tb1 <- tibble( i=1, x=1 )
tb2 <- tibble( i=1, y=1 )

write.csv(tb1, file=here("file1.csv"), row.names = FALSE )
write.csv(tb2, file=here("file2.csv"), row.names = FALSE )

merged-data.R reads:

### merge files

tb1 <- read.csv(here("file1.csv") )
tb2 <- read.csv(here("file2.csv") )

full_join( tb1, tb2, by="i") -> merged

saveRDS( merged, file = here("merged.Rds") )

And the report.Rmd is

    ---
    title: "Report"
    author: "John Doe"
    date: "`r Sys.Date`"
    output: html_document
    ---

    ```{r setup, include=FALSE}
    knitr::opts_chunk$set(echo = TRUE)
    library(here)
    ```

    Here are the results:

    ```{r print_merged}
    readRDS(here("merged.Rds"))
    ```

wlandau · March 12, 2019, 6:00pm

Good question. There is a chapter in the manual on setting up projects, and I recommend the section on script file pitfalls.

In traditional workflows, your code is a bunch of declarative scripts. But in drake, your scripts should mostly load packages and custom functions. In other words, most of your scripts prepare to do the work rather than actually doing the work directly.

Files

I would translate your toy example to these files:

make.R
report.Rmd
R/
├── packages.R
├── functions.R
└── plan.R
data/
├──file1.csv
└──file2.csv

make.R:

source("R/packages.R")
source("R/functions.R")
source("R/plan.R")
make(plan)

packages.R:

library(drake)
library(purrr)
library(readr)

functions.R:

my_merge <- function(files) {
  map_df(files, read_csv)
}

plan.R:

plan <- drake_plan(
  merged_data = my_merge(file_in("data/file1.csv", "data/file2.csv")),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.pdf")
  )
)

report.Rmd

---
title: "Report"
author: "John Doe"
output: pdf_document
---

Here are the results:

```{r print_merged}
library(drake)
readd(merged_data)
```

Above, the code chunk in report.Rmd makes use of readd() (see also loadd()). readd() and loadd() primarily load targets from the cache, but in knitr / R Markdown reports, they also serve to declare dependencies on targets.

When you are ready, you can start a fresh R session and run make.R. Unless you take the time to set up a _drake.R file and call r_make(), I recommend batch mode for make.R.

StasK · March 12, 2019, 6:10pm

So before invoking drake magic, I would still need to run all of the R/*.R files (or wrap everything into a package so all of them get loaded on autopilot). I am sure your thought was put into NOT making these declarative scripts a part of drake plan; would that create circular dependencies, or are there other reasons?

wlandau · March 12, 2019, 6:35pm

So before invoking drake magic, I would still need to run all of the R/*.R files

Yes, drake::make() looks at the packages and functions in the current session, and it assumes you already ran all the setup scripts. Same goes for predict_runtime(), vis_drake_graph(), outdated(), and pretty much anything with a config or envir argument. Other functions like loadd() and readd() just need the cache, so they do not usually need to source() any setup scripts.

I am sure your thought was put into NOT making these declarative scripts a part of drake plan; would that create circular dependencies, or are there other reasons?

In Makefiles, targets and dependencies are files. But in drake, the dependencies are not scripts, but rather the functions and variables produced by the scripts. This may seem counterintuitive for people who are already familiar with pipeline tools, but it is a deliberate design choice, and it is part of what it means for drake to focus on R.

So yes, R scripts should not be invoked in the plan itself. If a script creates new variables, running it as part of a target could create new/malformed dependency relationships well after drake thinks it has already figured out what to run when. Also, drake does not dive into file_in() files to hunt for dependencies, so it is likely to miss dependencies you mention in those scripts. Circularity is also a possibility.

(or wrap everything into a package so all of them get loaded on autopilot)

Good idea. People have certainly done this, and I do encourage it. However, it requires extra care. A package creates its own environment to put functions and data objects, so if you write a drake workflow as a formal R package, you will need to call expose_imports() to make those functions available to drake's dependency detection system.

wlandau · March 12, 2019, 6:43pm

TL;DR: scripts inside plans (e.g. drake_plan(your_target = file_in("script.R"))) fundamentally contradict drake's design philosophy. Even when they appear to work, I still do not recommend them.

You are not the first one to raise this issue (e.g. https://github.com/ropensci/drake/issues/193). If you know of a better way to explain it, please feel free to contribute to the manual (https://github.com/ropenscilabs/drake-manual).

To turn declarative scripts into plans and vice versa, you can use code_to_plan(), plan_to_code(), and plan_to_notebook().

StasK · March 12, 2019, 7:00pm

I think a heap of R scripts is a typical setup for most projects, so that's why mine is an FAQ. @kbroman's Makefile-driven projects are an exception :).

My primary project build background is Stata project (https://ideas.repec.org/c/boc/bocode/s457685.html). Somewhat like make, it's dependency tracking principle is tracking files. Unlike make or drake, Stata project does not have a concept of goals. Instead, there is a master script that calls other scripts with Stata's analogue of source(), and it is each script's responsibility to declare its own dependencies with Stata's syntax analogue of project::creates( filename ), project::uses( filename ) or project::original( filename ). The project metadata, somewhat similar to the drake_config() object, is the set of dependencies (which file is being created by which script, and which files down the pipeline use it) and statuses (file dates, sizes and hash sums). So when project::build() is invoked, project loads the master script, checks which of the slave scripts have had their source code changed, or have had their input data dependencies changed, and reruns only the modified parts (or new scripts freshly added to the master).

JFYI -- that's a different workflow and a different thinking, just to explain where I am coming from.

wlandau · March 12, 2019, 7:49pm

Yeah, that makes sense. Stata's build system sounds file-oriented, which is the norm. You are not the only one with this kind of background.

I think drake is the one that needs explaining. I am confident in the design philosophy, but it is still weird. I am still figuring out how to teach it.

StasK · March 12, 2019, 8:00pm

Na-ah, it makes sense -- in the context of you wanting to make it very #rstats-y... purrr-y and what not. We as users are grateful to you for developing the toolset .

Does the size matter? If I have a handful of tibbles/data frames that are 2Gb each, would drake / storr caching mechanism keep them on disk, most of the time?

P.S. ask @krlmlr how he teaches that

krlmlr · March 12, 2019, 8:07pm

This is amazing. The Stata-like declarative workflow -- does it work for you in practice? I remember an early attempt at tackling that problem: https://github.com/krlmlr/darn.

As a stop gap for adapting drake, you could say e.g.

source_with_deps <- function(file, ...) {
  force(list(...))
  source(file)
  invisible()
}

and

drake_plan(
  munge = source_with_deps(file_in("10-munge.R")),
  model = source_with_deps(file_in("20-model.R"), munge)
)

but, as Will said, a more natural thing to do is to convert your scripts to pure functions and replace the load(), readRDS(), save() and saveRDS() calls etc. with function arguments and return values, respectively.

wlandau · March 12, 2019, 8:10pm

Does the size matter? If I have a handful of tibbles/data frames that are 2Gb each, would drake / storr caching mechanism keep them on disk, most of the time?

2 GB each should work, but it will take time to serialize all that data. In make(verbose = 6), you will get messages that compare execution time to overall processing time. If you find that drake's cache takes too long, feel free to work around it with file_in() and file_out(). I intend to continue work on improving cache speed.

P.S. ask @krlmlr how he teaches that

Yes, @krlmlr has contributed great teaching materials, as well as some of the best ideas in drake's API and high-performance computing functionality. You can thank him for conceiving of file_in(), file_out(), and knitr_in().

StasK · March 12, 2019, 8:44pm

@krlmlr I thought that Stata thingy worked fine for my purposes. (Mind you, this is not an official Stata command, but a really obscure third-party package... as obscure as library(drake).) At the very least, I was able to combine what that package offered with other tricks and tools to work for me. I think the distinction from darn is that Stata scripts cannot properly process the project package functionality, which is only available when the package is being built. At that stage, everything is under the control of project, and it knows which directory it is now in, and what script it is currently running. So the dependencies cannot be updated by running isolated scripts, which seems possible with darn. There are minor issues when timestamps are printed by default in the text output files by some of the external programs, etc., so you need to know what the workarounds are. But generally it fits well within Stata procedural language concept.

@wlandau I will be exploring -- thank you for all the pointers!

system · April 2, 2019, 8:45pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.