Hami
January 8, 2021, 6:39pm
1
I have a data with disease_diagnosis_date and inpatient_start_date, both are in date format. I want to filter the data for 2 conditions
disease_diagnosis_date was with in 2 weeks of inpatient_start _date
disease_diagnosis_date was with in 2 weeks before or on the day of diagnosis of inpatient_start_date
Can you please help me write the code? . Thanks.
Hi @Hami ,
Could you also provide a sample of your dataset. Without the data, it would be difficult to help you with the code. You could use the dput(your_data_here)
to share your data. If it is too big, then you could use a subset of a few rows: dput(your_data_here[1:50, ])
.
This great article here talk about how to make a reproducible example, which is what I am essentially asking you: FAQ: How to do a minimal reproducible example ( reprex ) for beginners
Hami
January 8, 2021, 6:51pm
3
yeh sure, Lets say I have a dataset named df
Disease_diagnosis date Inpatient_start_date
1 12/20/20 12/30/20
2 11/9/20 12/10/20
3 5/7/20 5/7/20
4 6/5/20 8/5/20
5 9/3/20 12/8/20
6 11/2/20 11/4/20
Please filter data based on these two conditions
disease_diagnosis_date was with in 2 weeks(before or after or on the day)of inpatient_start _date
disease_diagnosis_date was within 2 weeks interval before or on the day of diagnosis of inpatient_start_date
Thanks!
Hi @Hami ,
could you create the dataset in R and share it with dput()
please? I cannot use in my R session as it is posted at the moment.
Hami
January 9, 2021, 2:54am
5
Hi, I am new to R. I tried to use the dput function
structure(list(inpatient_start_date = c("12/30/20", "12/10/20",
"5/7/20", "8/5/20", "12/8/20", "11/4/20")), class = "data.frame", disease_diagnosis_date = c("12/20/20", "11/9/20", "5/7/20", "6/5/20", "9/3/20", "11/2/20"))
Hope this would work.
These are the two steps in my code below:
create a column which calculates the time difference between the diagnosis and inpatient start dates
create two more columns (case_1
and case_2
) that test the conditions you explained in your question:
# Load packages
library(lubridate)
library(dplyr)
# Recreate the dataset and add time_difference co
mydata <- data.frame(
disease_diagnosis_date = mdy(c("12/20/20", "11/9/20", "5/7/20", "6/5/20", "9/3/20", "11/2/20")),
inpatient_start_date = mdy(c("12/30/20", "12/10/20", "5/7/20", "8/5/20", "12/8/20", "11/4/20"))
)
# Create new columns
mydata %>%
mutate(
time_difference = as.numeric(inpatient_start_date - disease_diagnosis_date),
case_1 = time_difference <= 14 & time_difference > 0,
case_2 = time_difference <= 14 & time_difference >= 0
)
disease_diagnosis_date inpatient_start_date time_difference case_1 case_2
1 2020-12-20 2020-12-30 10 TRUE TRUE
2 2020-11-09 2020-12-10 31 FALSE FALSE
3 2020-05-07 2020-05-07 0 FALSE TRUE
4 2020-06-05 2020-08-05 61 FALSE FALSE
5 2020-09-03 2020-12-08 96 FALSE FALSE
6 2020-11-02 2020-11-04 2 TRUE TRUE
Your original question was about filtering the data, so here is the code:
# Load packages
library(lubridate)
library(dplyr)
# Recreate the dataset and add time_difference column
mydata <- data.frame(
disease_diagnosis_date = mdy(c("12/20/20", "11/9/20", "5/7/20", "6/5/20", "9/3/20", "11/2/20")),
inpatient_start_date = mdy(c("12/30/20", "12/10/20", "5/7/20", "8/5/20", "12/8/20", "11/4/20"))
) %>%
mutate(time_difference = as.numeric(inpatient_start_date - disease_diagnosis_date))
# Filter for case 1
mydata %>%
filter(time_difference <= 14, time_difference > 0)
# Filter for case 2
mydata %>%
filter(time_difference <= 14, time_difference >= 0)
Hope this helps.
Hami
January 9, 2021, 5:32pm
7
Thanks so much for your reply. In my actual data, the disease_diagnosis_date and inpatient_start_date are in the dates format and I think thats why .numeric function is not working. If you think that can be a problem, is there a way to change date to numeric format and vice versa.
In the data I made above, these columns are also in the data format. You can still perform a subtraction without any problem.
Hami
January 9, 2021, 8:58pm
9
COVIDinpatientsunique%>% mutate(time_difference = as.numeric(inpatient_visit_start_date - measurement_date))
Error in UseMethod("mutate_"): no applicable method for 'mutate_' applied to an object of class "SparkDataFrame"
I am getting this error in my original data. The inpatient_visit_start_date and measurement date are in date format. I have uploaded the lubridate and dplyr libraries. Any other suggestions. I may not have the most updated libraries and the platform I am working on would not let me update the libraries. Any ideas?
Thanks!
It seems that you are using the sparklyr
package to import your dataset?
system
Closed
January 23, 2021, 7:09am
11
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.