Cleaning data across multiple columns/variables

trangility · November 21, 2020, 12:23am

Hi everyone, have a large dataset with two type of patient identifiers--a patient ID, and a unique identifier for each admission (for example, patient C who has been admitted 3 times will have 3 unique admission identifiers). I also have a column with outcomes Survived and Death, and race. How do I manipulate my data so that multiple admission are not counted, and ultimately, the eventual outcome of Survived vs Death for each patient is counted? Also interested in Race to see if there's any correlation. I have included some sample data and how I eventually want my table to look. Please help! Thank you!!

DF <- data.frame(
  Patient.ID = c("A", "B", "C", "C", "C", "D", "D"),
  Admit.ID = c("1Zz", "1Yy", "5Pp", "3Cc", "9Dd", "4Yy", "4Dd"),
  Race = c("White", "Black", "Asian", "Asian", NA, "Black", "Black"),
  Survived = c(1, 0, 1, 0, 1, 1, 1),
  Died = c(0, 1, 0, 1, 0, 0, 0))

DF_Cleaned <- data.frame(
  Patient.ID = c("A", "B", "C", "D"),
  Race = c("White", "Black", "Asian", "Black"),
  Outcome = c(0, 1, 1, 0))

^{Created on 2020-11-20 by the reprex package (v0.3.0)}

andresrcs · November 21, 2020, 12:35am

Does this work for you?

library(dplyr)

DF <- data.frame(
    Patient.ID = c("A", "B", "C", "C", "C", "D", "D"),
    Admit.ID = c("1Zz", "1Yy", "5Pp", "3Cc", "9Dd", "4Yy", "4Dd"),
    Race = c("White", "Black", "Asian", "Asian", NA, "Black", "Black"),
    Survived = c(1, 0, 1, 0, 1, 1, 1),
    Died = c(0, 1, 0, 1, 0, 0, 0))

DF %>% 
    group_by(Patient.ID) %>% 
    summarise(Race = first(na.omit(Race)),
              Outcome = if_else(sum(Died) > 0, 1, 0)
              )
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 4 x 3
#>   Patient.ID Race  Outcome
#>   <chr>      <chr>   <dbl>
#> 1 A          White       0
#> 2 B          Black       1
#> 3 C          Asian       1
#> 4 D          Black       0

^{Created on 2020-11-21 by the reprex package (v0.3.0.9001)}

nirgrahamuk · November 21, 2020, 1:13am

slight simplification is possible

DF %>% 
  group_by(Patient.ID) %>% 
  summarise(Race = first(na.omit(Race)),
            Outcome = last(Died)
  )

andresrcs · November 21, 2020, 1:22am

I didn't assume that the records are chronologically ordered but that makes sense.

trangility · November 21, 2020, 1:32am

Thank you both! The data is not chronologically order unfortunately, but using this worked perfectly!

DF %>% 
    group_by(Patient.ID) %>% 
    summarise(Race = first(na.omit(Race)),
              Outcome = if_else(sum(Died) > 0, 1, 0)
              )

system · November 28, 2020, 1:32am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.