Removing duplicate cases with a condition

I have a dataset df. the dput (df) is
structure(list(Case_ID = c(23L, 23L, 24L, 24L), HR = c(120L,
110L, 130L, 115L), measurement_date = c("2020-12-10", "2020-12-13",
"2020-12-15", "2020-12-13"), admission_date = c("2020-12-10",
"2020-12-10", "2020-12-09", "2020-12-09")), class = "data.frame", row.names = c("1",
"2", "3", "4"))
I want to remove duplicates for the variable HR and keep only one reading of HR per Case_ID, with the condition that the row that has the measurement_date closest to admission_date is kept while others are removed for that particular Case_ID.
Thanks in advance!

Hi,

Here is a way of doing that:

library(dplyr)

myData = structure(list(
  Case_ID = c(23L, 23L, 24L, 24L), 
  HR = c(120L, 110L, 130L, 115L), 
  measurement_date = c("2020-12-10", "2020-12-13", "2020-12-15", "2020-12-13"), 
  admission_date = c("2020-12-10", "2020-12-10", "2020-12-09", "2020-12-09")), 
  class = "data.frame", 
  row.names = c("1", "2", "3", "4"))

myData = myData %>% 
  mutate(
    measurement_date = as.Date(measurement_date),
    admission_date = as.Date(admission_date)
  )

myData = myData %>% 
  mutate(dateDiff = measurement_date - admission_date) %>% 
  group_by(Case_ID) %>% 
  filter(dateDiff == min(dateDiff))

myData
#> # A tibble: 2 x 5
#> # Groups:   Case_ID [2]
#>   Case_ID    HR measurement_date admission_date dateDiff
#>     <int> <int> <date>           <date>         <drtn>  
#> 1      23   120 2020-12-10       2020-12-10     0 days  
#> 2      24   115 2020-12-13       2020-12-09     4 days

Created on 2021-01-23 by the reprex package (v0.3.0)

Hope this helps,
PJ

Thanks so much for the help. I am using a big dataset and the filter for min (dateDiff) yields more than one case for the case_ID, if there are ties. I wanted to reduce it to only one case per case_ID. Also there were null values for the measurement_date in my data and I wanted to keep all the cases where measurement_date was null. So I did a little bit digging and modified your code slightly to get the desired result. Just wanted to post that.

library (dplyr)
myData= myData %>%
mutate(dateDiff = as.numeric( measurement_date - admission_date ))%>%
group_by(Case_ID) %>%
filter(dateDiff == min(dateDiff)|is.na (dateDiff))
myData=myData[!duplicated(myData1$Case_ID), ]

Again thanks for your help

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.