Reproducible Research and Sensitive Data

ben-e · December 5, 2017, 6:28pm

Hi, I'm starting my first independent research project. I would like to use reproducible research methods, as much as possible, specifically:

All code and data in a Docker container
The paper itself will be written in R Markdown
I don't exactly like the paper as a package approach, but I may try it again.

I'm planning on compiling my main dataset from several sources, however this process will involve using data that contains private information that can not be publicly released. The final product (with sensitive data removed) can be released.

My best idea is that I could use two separate Docker containers: one for compiling the dataset and removing sensitive data, and then one for the actual analysis.

Does anyone have any best practices for a situation like this?

Thanks,

Ben

hughparsonage · December 6, 2017, 6:23am

How do you access the sensitive data? Are you securely connected to a database? Have you been mailed a disk (or computer) containing the data?

A big problem with reproducible methods on sensitive data is that the 'least-reproducible' part of the process is often upstream of you receiving the data. For example, you can have code that takes in a bunch of csvs that have been prepared for you and turns them into a full-report, only to realize that the way those csvs were produced wasn't documented.

That aside, I think a simple R script would be sufficient for the process of sensitive data to public data.

ben-e · December 6, 2017, 3:30pm

Hi Hugh,

The sensitive data is a .csv directly on my hard drive. I've more or less been a part every step of the data collection process.

I didn't really realize that I can probably still provide the script to process the sensitive data, even if I don't include the raw data itself.

Thanks,

Ben