RMarkdown with PHI/sensitive data

jenniferthompson · April 1, 2021, 10:03pm

(forgive me if this has been asked before, a search didn't turn up an answer!)

I'm at a healthcare company where we do literally all our development remotely, by SSHing into an AWS server. In addition to other reasons, we do this to ensure all our sensitive health data is protected and secure - there's no need to do anything locally.

This all works mostly great, except I really miss being able to create RMarkdown documents as a data product. Though we could connect local instances of RStudio to our data warehouse, it creates an opportunity for sensitive data to be stored on local machines, which is something we actively try to prevent.

Do others working with PHI or other sensitive data have this issue? How have you solved it? Our best reporting solution right now is JupyterHub, which does the job but I really miss RMarkdown

MyKo101 · April 1, 2021, 10:20pm

I work with a very similar setup using PHI data from clients on AWS. Our team have RStudio server set up on our EC2 instances. This way, the data stays on the instance and isn't transferred to our local machines. RMarkdown works perfectly well. Unfortunately, as I don't work within the Data Engineering team, I can't give any information about how to set this up. I did find this link which may be a good starting point.

jenniferthompson · April 1, 2021, 10:24pm

That's fantastic, thank you very much! Definitely passing along to our DE team.

jenniferthompson · April 1, 2021, 10:26pm

Just curious @MyKo101 - do you have the commercial or open source version of RStudio server?

MyKo101 · April 1, 2021, 10:31pm

I'm not sure. I assume the commercial version since we're using it within an organisation and not for personal use.

technocrat · April 2, 2021, 12:03am

@MyKo101 has the best long-term solution. A workaround is creating a keyfield for use in reporting. The idea is that even though data is being processed locally, it's not personally identifiable unless somehow the combination of variables can be uniquely traced to a handful of individuals.

Presumably, your reports are roll-ups, reporting data in aggregate into relatively large categories (all patients in King County, WA, for example). And, also, with moderately well-provisioned local work stations, the data can be kept in memory without writing to disk, and so should be flushed if you don't save .RData. This means that the user should get into the habit of savings scripts, not source data and scripts‐recreating a report from the source data over a secure connection is safer than saving the source data locally.

This requires a level of workstation security that anyone dealing with PHI or similar data should be maintaining. This is, of course, more difficult on some operating systems than others.

Ultimately, I do agree that setting up virtual access for all storage and processing is preferable. With GB-bandwidth, speed should no longer be an issue at all.

jenniferthompson · April 2, 2021, 1:22am

Thanks @technocrat! We'd definitely enforce good practices; very on board with your workflow points, and all our data has non-identifiable keys in addition to identifiable info. We've talked about using masking and other long-term solutions to reduce who can access PHI fields, but we're not there yet, so trying to figure out what's possible with our current setup. Given what would be possible (even if we're careful), local use of RStudio (or any tool) is definitely not encouraged JupyterHub has, as you say, been our virtual solution, and has worked well for security purposes (and to your point, we've seen no issues with speed/connectivity) - just less optimal for stakeholder-facing reproducible analyses!

system · April 23, 2021, 1:23am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.