I have a dataset df. the dput (df) is
structure(list(Case_ID = c(23L, 23L, 24L, 24L), HR = c(120L,
110L, 130L, 115L), measurement_date = c("2020-12-10", "2020-12-13",
"2020-12-15", "2020-12-13"), admission_date = c("2020-12-10",
"2020-12-10", "2020-12-09", "2020-12-09")), class = "data.frame", row.names = c("1",
"2", "3", "4"))
I want to remove duplicates for the variable HR and keep only one reading of HR per Case_ID, with the condition that the row that has the measurement_date closest to admission_date is kept while others are removed for that particular Case_ID.
Thanks in advance!
Hi,
Here is a way of doing that:
library(dplyr)
myData = structure(list(
Case_ID = c(23L, 23L, 24L, 24L),
HR = c(120L, 110L, 130L, 115L),
measurement_date = c("2020-12-10", "2020-12-13", "2020-12-15", "2020-12-13"),
admission_date = c("2020-12-10", "2020-12-10", "2020-12-09", "2020-12-09")),
class = "data.frame",
row.names = c("1", "2", "3", "4"))
myData = myData %>%
mutate(
measurement_date = as.Date(measurement_date),
admission_date = as.Date(admission_date)
)
myData = myData %>%
mutate(dateDiff = measurement_date - admission_date) %>%
group_by(Case_ID) %>%
filter(dateDiff == min(dateDiff))
myData
#> # A tibble: 2 x 5
#> # Groups: Case_ID [2]
#> Case_ID HR measurement_date admission_date dateDiff
#> <int> <int> <date> <date> <drtn>
#> 1 23 120 2020-12-10 2020-12-10 0 days
#> 2 24 115 2020-12-13 2020-12-09 4 days
Created on 2021-01-23 by the reprex package (v0.3.0)
Hope this helps,
PJ
Thanks so much for the help. I am using a big dataset and the filter for min (dateDiff) yields more than one case for the case_ID, if there are ties. I wanted to reduce it to only one case per case_ID. Also there were null values for the measurement_date in my data and I wanted to keep all the cases where measurement_date was null. So I did a little bit digging and modified your code slightly to get the desired result. Just wanted to post that.
library (dplyr)
myData= myData %>%
mutate(dateDiff = as.numeric( measurement_date - admission_date ))%>%
group_by(Case_ID) %>%
filter(dateDiff == min(dateDiff)|is.na (dateDiff))
myData=myData[!duplicated(myData1$Case_ID), ]
Again thanks for your help
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.