I have a dataset that I am trying to tidy.
I'm using Hadley's paper -- and the corresponding R4DS chapter -- as a guide. However, I'm not seeing a one-to-one correlation between my situation and the examples in the article/chapter. So, I would appreciate a little feedback from the community.
Setup:
- I have a relatively large dataset of electronic health records.
- For each patient encounter, there are multiple rows. The number of rows is equal to the number of combinations of several variables that are able to take multiple values. In the example below, I'm using employee id (emp_id), medical history (med_hist), and current medications (meds).
- The number of values varies from encounter to encounter.
Here is a small illustrative example:
library(tidyverse)
df <- tibble::tibble(
encounter_id = c(rep(101, 12), rep(102, 2)),
date_entered = as.Date(rep("2015-09-17", 14)),
gender = c(rep("f", 12), rep("m", 2)),
emp_id = c(rep(119, 6), rep(430, 6), 286, 559),
med_hist = c(rep(c(rep("diabetes", 3), rep("hypertension", 3)), 2), rep("mi", 2)),
meds = c(rep(c("asa", "avandamet", "claritin"), 4), rep("lipitor", 2))
) %>%
print()
# A tibble: 14 x 6
encounter_id date_entered gender emp_id med_hist meds
<dbl> <date> <chr> <dbl> <chr> <chr>
1 101 2015-09-17 f 119 diabetes asa
2 101 2015-09-17 f 119 diabetes avandamet
3 101 2015-09-17 f 119 diabetes claritin
4 101 2015-09-17 f 119 hypertension asa
5 101 2015-09-17 f 119 hypertension avandamet
6 101 2015-09-17 f 119 hypertension claritin
7 101 2015-09-17 f 430 diabetes asa
8 101 2015-09-17 f 430 diabetes avandamet
9 101 2015-09-17 f 430 diabetes claritin
10 101 2015-09-17 f 430 hypertension asa
11 101 2015-09-17 f 430 hypertension avandamet
12 101 2015-09-17 f 430 hypertension claritin
13 102 2015-09-17 m 286 mi lipitor
14 102 2015-09-17 m 559 mi lipitor
In this toy example, I have 14 rows for just two hypothetical encounters. In the real data, I have some encounters with 50+ rows. This just doesn't feel efficient/tidy to me.
Having said that, I believe that in its current form:
- Each variable is placed in its own column.
- Each observation (defined as combination of encounter/emp_id/med_hist/meds) is its own row.
- Each value is placed in its own cell.
As an extra wrinkle, I need to join this dataset with another dataset that has one or more rows per encounter. This currently results in an even larger dataset with all possible combinations of variables. Therefore, it would be nice to have a single row for each encounter.
Another option for organizing this data that occurred to me was to reshape the data to a wide format. For example, have a variable called med_hist_1, med_hist_2 ... med_hist_n (repeat for emp_id, meds, etc). That would result in each row corresponding to a single encounter. However, it would also be inefficient in the sense that it would add a lot of NA values to the data.
So, I would appreciate your thoughts about whether or not this data is already in a "tidy" format or if it should be reorganized. If so, how?