Looking for advice or repositories for RMarkdown data projects

realhiphop · October 22, 2018, 5:59pm

I'm a complete newbie to R Markdown. I've done a ton of reading on the different ways to setup R Markdown files, and have decided that I want to use my external data, and .R scripts to source the data. I then plan to build the tables for my analysis with code in the R Markdown file.

2 questions:

How does everyone list the source for a table (meaning what script the data comes from) in their R Markdown file?
Does anyone have any resources or repos on github that they can point me to in order to see a real file Data Analysis project that uses raw data, r scripts, and R Markdown?

Thanks in advance.

clausp · October 22, 2018, 6:48pm

On 2: There is a whole book on R Markdown: R Markdown: The Definitive Guide. That is an excellent place to start.

On 1: Not quite sure what you are asking. You use ``` to enclose the chunks for R code inside R Markdown. Are you trying to incorporate a separate file?

olyerickson · October 22, 2018, 7:06pm

You need to start by actually reading the excellent, online R Markdown book, and esp. review the examples.

Before attempting with your own code, make sure you actually understand what the explained examples are doing!

jcblum · October 22, 2018, 7:34pm

Hi @realhiphop! Welcome!

On question 1, are you asking about how you would cite the source for your tables in your report text? Or how you cause the external scripts to run so that the data becomes available for your report code to beautify and present?

And can you explain a bit more about your workflow? Are your external data in a database? Some CSVs? Coming from a third-party API?

The question of how to incorporate data into an R Markdown workflow comes up a lot and there's no one right answer. Here's a previous discussion that might be of interest, now or later:

Best Practices for Reproducible Research - Should We Show Full Mapping of Raw Data to That Used in Research? General

I'd like to follow up on this a little bit. I've had a look around at different resources and they all introduce the concept of reproducible research which makes sense. However, I'm having a hard time figuring out exactly which method is the "best" for defining data within the markdown document. In my example I have a script for wrangling a file that I imported from .xlsx which results in my final dataset. Should I copy this entire script into the markdown file and set include = FALSE? Should I save it with save.Rdata and use load()? Are there any other options that are "better"? Thank you! Split from Error in UseMethod("select_") : when trying to Knit Rmarkdown

Your question #2 is a good one, and I think a bit different from looking at synthetic examples (those are also important though!). But I fear people with relevant projects to share might not find this post because the title is fairly uninformative. Maybe considering editing it to focus on your specific questions? "Newbie" is often in the eye of the beholder anyway!

realhiphop · October 22, 2018, 9:02pm

Thanks so much.
I’ll give a little more detail on my analysis. My project is using sports data. The data comes in a few flavors:

API Pulls
.csv files
Data Frames that I’ve created and exported after cleaning up in dplyr
I also have R Scripts that I’ve created to do some of the analysis in addition to table joining between different data sources.
My plan is to use some of the scripts I’ve created that either get data, or synthesize data and load the resulting data frames into R Markdown. I’m then planning to beautify the results.
For question 1: The data being used is going to be coming from specific R scripts (in the case that I import .rds files with data frames export from an R script).

jlacko · October 23, 2018, 6:54am

The workflow that works for me (as usual - your mileage may vary) is the following:

start the rmd document with an init chunk, which runs very silently

{r init, echo = F, eval = T, message = F, warning = F}

This chunk loads all the data - be it from csv, database or by sourcing other R files.

It does so quietly - so the output does not find its way into the final document - and therefore it has messages and warnings turned off.
In addition I am often forced to wrap its content in capture.output( { ... }, file = '/dev/null') to stop any output bleeding into my final document.

As this chunk has warnings and errors supressed I found it good practice to limit it to loading data & making sure to close all database connections. Full stop.

continue with other chunks, that do the real work using the data.frames loaded earlier.

It is normal that something breaks in the "other" chunks from time to time. That is life. But I found it advantageous to make sure the data is loaded in isolation, and that I am not left with any open database connections when stuff breaks.

realhiphop · October 24, 2018, 7:21pm

Was hoping to get some hits after changing the title!

With regards to sources, I was looking for advice. I have scripts for data acquisition, and scripts that do analysis with that data. My plan is to call in the relevant data frames via .rds.

I've been thinking about how do I document where the .rds came from in the R markdown file so that if I pick up the file next year, I know which .r script the R file came from so that I can run it to refresh the data next year.

Is commenting out the name of the .R script on top of the .rds file the best approach?

I'm trying to get better about my documentation practices as I build scripts.