What are Rstudio's solution to workflow management systems

We have seen many customers have success using R Markdown + RStudio Connect to automate simple workflows.

As an example, this R Markdown document pulls in some stock data, cleans it, and writes the results to a database: https://colorado.rstudio.com/rsc/content/1032/Portfolios_ETL.html

The document can be deployed to RStudio Connect and scheduled.

The benefits of this approach are:

  1. The deployment to Connect automatically handles creating an environment with the proper R packages and a matched version of R. This can be a pain in general purpose workflow runners.
  1. The scheduling is easy with a nice UI:

  2. Connect will email you if the task fails, and can optionally email you on success.

  3. Because we're using R Markdown, the ETL code is documented in-place, which is really handy. We can even create some quick graphs to visually check the results of our process over time. In Connect, you can automatically scroll through render histories:

The main limitation is that this scheduling does not account for DAGs. As an example, say you wanted to pull data, fit 10 different models, compare the model results, pull in some supplemental data, and then finally merge the results and supplemental data into a report. A DAG lets you represent each of those as a step, and also lets you arrange them in dependent order. The benefit is that tools that run DAGs are often lazy. e.g. in this case, if one of the models fail, and then you restart the process, a DAG tool will typically not re-run all the models, just the one that failed. Likewise, a DAG tool would usually be smart enough to know whether or not the supplemental data has changed. You could write similar functionality into a R Marldown document, but you'd be reinventing lots of wheels.

Overall though, if you are getting by with taskscheduleR, you'd likely get a lot of mileage out of R Markdown and R Markdown + RStudio Connect.

5 Likes