Best Practices for Organizing RMarkdown Projects

beckywest · September 18, 2017, 10:30pm

Often, when I create an RMarkdown report, I like to integrate my code chunks with my write-up. For example, I would include the code chunks for pulling and cleaning data in with the section describing my dataset. I would include chunks related to running and checking a model in the section that describes that model, etc. This makes it easy for me to move from writing code in a notebook like fashion, to writing up a more polished report. It also makes it easy to sort through my RMarkdown file to answer questions / make updates related to a specific aspect of the report.

However, things are not as easy when the order of my code does not match the order of my write-up. This is a particular problem in the opening section of my document (abstract / executive summary) where I like to summarize my findings. I often want to reference a specific statistic or result that is not created in the code until end of my document. Rather than re-order my chunks, I have been “fixing” this issue by hard-coding a few values in the abstract section of my document. However, I know this is not best practice for creating a reproducible analysis.

Does anyone have suggestions or best practices about how to deal with this issue (reordering your code chunks, running external scripts, hardcoding a few pesky values, …), or any best practices in general for organizing a project whose results will need to be shared in a polished RMarkdown format?

nick · September 18, 2017, 11:50pm

The best option I'm aware of is with the ref.label chunk option:

```{r first, echo=FALSE, results="hide", ref.label="last"}
.```

```{r middle, echo=TRUE}
print(x)
```

```{r last, echo=TRUE}
x <- rnorm(1)
print(x)
```

Unfortunately, the output shows the downside that I'm not sure has a good workaround:

print(x)
## [1] -1.788384
x <- rnorm(1)
print(x)
## [1] 0.8485357

If you need to evaluate your code in the bottom half of the document (because it produces output), then it effectively has to run twice. I don't see an easy workaround for that. On the other hand, if you break your display code into a separate chunk than your analysis code, it won't be an issue. Alternatively, if your code is deterministic and you don't mind it being evaluated twice, you can just live with it.

Frank · September 19, 2017, 12:04am

As a poll:

Reordering code chunks
Running external scripts
Hardcoding a few pesky values
Doing something else

0 voters

I run an external script, and it works pretty much as you'd expect:

Use the external script for all substantive analysis and to save a bunch of tables to .rdata or .rds.
Use the markdown doc to read the tables in and makes graphs, tables, etc.

Not sure if that's heretical as far as reproducible research norms go, but it seems transparent enough to me and it makes my code easier to read and maintain.

The ref.label option @nick pointed out looks handy, but I guess I still prefer separate docs rather than messing with analytical and publishing concerns all at once.

emilyriederer · September 20, 2017, 11:27am

This wouldn't work in every case, but what I do is I save snapshots of my data or output as RDS files as key steps in my analysis. Then, I can load them earlier in analysis than they are generated or even in separate files. (The size of my reports generally leads me to use the modular solution described by Yihui on StackExchange)

Of course, this requires more care to make sure that these snapshots are getting appropriately refreshed as part of your workflow. However, this can also be beneficial if you don't want to run some very long computations every single time you "knit" (I typically do this by altering eval = TRUE / FALSE parameters.)

pschloss · September 20, 2017, 7:08pm

I find that RMarkdown documents can get pretty clunky when there's a lot going on with data munging, slow analysis steps, etc. This is especially so if I'm editing text and formatting and need to rerun render multiple times to get everything to look "just so". I prefer to use a Makefile to generate the various data files I want to run. Then I can read the data frames into my Rmd file and use that to focus on the text and formatting. I can also generate the rendered Rmd file from make as well.

taras · September 20, 2017, 7:10pm

I have a somewhat related question here.
I want to write an article in Rmarkdown, generating the graphs, but not showing any code whatsoever. That's fine, we can do that, that's done.

Now, if I wanted to include the R script at the bottom of the post (i.e. "for all interested, here is the code!"), how can I effectively do that?

I guess I could reference the rmarkdown file itself within the rmarkdown file (haven't tried it yet, will I break the space-time continuum?? ), but even if it works - the script already contains all the text that I don't need to display because people have already read it!

First, does anyone understand what I'm trying to say?
And second: anyone knows how to achieve this: have a clean output followed by all code?

tbradley · September 20, 2017, 7:24pm

According to this SO answer, the trick is to reference all your code chunks at the bottom of your document with eval = FALSE in the chunk options. Then for all the chunks embedded in the document you would set echo = FALSE

nick · September 20, 2017, 7:25pm

@tbradley, no need to copy/paste!

@taras, see the ref.label technique I posted before. I made it somewhat convoluted due to the different context, but in your case (assuming I understand your case ), it works perfectly.

```{r result, echo=FALSE}
# Complicated calculation
x <- 1 + 1

# Outputted result
print(x)
```

Plain text between

```{r code, echo=TRUE, eval=FALSE, ref.label="result"}
```

This outputs:

## [1] 2
Plain text between

# Complicated calculation
x <- 1 + 1

# Outputted result
print(x)

taras · September 20, 2017, 7:55pm

Yeah, I figured your flow will work, but jumped the gun and posted the question before testing it!
Thanks! I will work with this!

Frank · September 20, 2017, 8:05pm

If you're willing to point them somewhere other than at the bottom, you could make a separate R-only file with knitr::purl("this_file.rmd").

Alternately, if it must be at the bottom, you could run purl and then read in the result (albeit missing all syntax highlighting):

```{r result, echo=FALSE, results="hide"}
# Complicated calculation
x <- 1 + 1
```

Words words words `r x`.

```{r code, echo=FALSE, comment=""}
# run this separately: knitr::purl("this_file.rmd", "bah.r")
cat(readLines("bah.r"), sep="\n")
```

pavopax · September 20, 2017, 10:42pm

I separate out data processing into R scripts, managed with simple shell scripts. It’s then easier to develop and version control these messy parts.

Then, my Rmd notebooks focus on the results, and include relevant text, which is what most of my stakeholders need/want.

Here is my workflow (paul branch):

Andrea · September 21, 2017, 7:22am

@pavopax If I understand correctly, you include only visualization code in your .Rmd file, right? All the analysis code is in the code/ folder, and then (presumably) you load .rds files in your .Rmd file. Correct? Would it be possible to see an actual example, or are all your projects confidential? On one hand, I like your approach a lot because it makes debugging a lot easier, and rendering the .Rmd a lot faster. Also versioning is much better, as you noted. On the other hand, it kind of defeats the concept of literate programming...if one's main reason for using RMarkdown is results visualization and sharing, this is great, but I feel it wouldn't work for literate programming. I'd really like to see an actual example though - I may be wrong.

pavopax · September 21, 2017, 2:51pm

Thanks for your curiosity @Andrea!

Indeed, all current analyses are confidential and not accessible . But I show some parts of a real current project in a quick gist.

You are correct! Rmd notebooks may include tables, and do include analysis results (like linear model results), when these analyses are not complicated or don't involve large data. I usually import CSVs (they are not large, and can then be shared easily), created with my munging code, as follows.

/code is usually for data munging, or "ETL" stuff: this is the messy and time consuming part. As you point out, .R files are easier to debug, version control, and collaborate on (using github), and then I just import the derived datasets into the Rmd notebooks, which are lean and run quickly.

This is a living workflow, which works well for me and my stakeholders and that is why it exists.

However, I know lots of team members who write long Rmd notebooks which include everything: data munging, outputs, results, etc, following the literate programming paradigm.

Andrea · September 21, 2017, 4:39pm

@pavopax Thanks! It's very clear now. I was initially fascinated by literate programming, but I'm gradually moving to an approach similar to yours. Keeping the complex codes outside RMarkdown makes things easier.

StasK · September 22, 2017, 12:23am

@beckywest, have you tried playing with

` ` `{r child='abstract.Rmd'}
` ` `

` ` `{r child='main.Rmd'}
` ` `

abstract.Rmd might say

We found R^2 in our favorite regression to be {r rsq}.

while main.Rmd might say

The specification of interest was sepal length regressed on species:

` ` `{r}
rsq <- summary(lm(Sepal.Length ~ Species, data=iris))$r.squared
` ` `

Knitting would take several passes though.

davidhen · September 22, 2017, 8:07am

There is the projecttemplate package,
http://projecttemplate.net/
, but I prefer to use an adapted version of Joris Muller's project template here:
http://blog.jom.link/implementation_basic_reproductible_workflow.html

Generally I try (and often fail) to keep the html files really short, i.e. one task per file and save all the created objects (i.e. data_frames, models, plots) as .rds files. This give me a clear (and hopefully reproducible) workflow of where, and how, objects were created.
I then create summary pdf documents and directly load the reuslts I want into the report. e.g.

load("path/to/plot1.rds")
plot1

pditty · September 23, 2017, 3:21am

Do you not like the code folding option with hide turned on? The user can click "show all" at the top of the page to read the report with code, or click on any given chunk to see code for that section only.

thoughtfulnz · September 23, 2017, 9:51pm

I tried 4 or 5 different approaches when working on this project

github.com

thoughtfulbloke/eqnz/blob/master/004_not_dependence.Rmd

# It is not dependence

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(warnings=FALSE)
knitr::opts_chunk$set(errors=FALSE)
knitr::opts_chunk$set(message=FALSE)
knitr::opts_chunk$set(dpi = 150)
knitr::opts_chunk$set(fig.width =  6)
knitr::opts_chunk$set(fig.height =  4)
```


```{r}
Sys.setenv(TZ = "UTC") 
library(dplyr)
library(lubridate)
# Assumes there is eqnz_processed data created in chapter 2
load("eqdata/eqnz_processed.RData")
```

This file has been truncated. show original

Including naming chunks and referring back and declaring a .R file to be the content of the chunk.

In the end, the thing that showed what I wanted where I wanted it was to run a pre-knit external script that

purled each file to produce a script from the code
read in the Rmd as text to be processed
inserted the .R code in as a final eval=FALSE code chunk in the .Rmd
resaved the Rmd
knitted (in this case bookdown) all the Rmds

duythonguyen · September 26, 2017, 2:21am

I prefer passing parameters to the document rather than make data manipulating or modeling inside the document.