Filter dataframe to only included rows if all rows of the same id match the criteria

Hi,
For each ID i want to display all rows of the same id if all rows match with the condition,
so if at least one row is not matching, no row with the id must be displayed
so i have the following condition

'''
#Assume the following dataframe:

id <- c(10, 10, 11, 11, 3, 3)
instance <- c(1, 2, 1, 2, 1, 2)
response <- c(0, 0, 1, 1, 1, 1)
dt_start <- c('2021-05-11', '2021-05-11', '2021-05-11', '2021-05-11', '2021-09-15', '2021-09-15')
dt_end1 <- c('2015-09-09', '2021-09-14', '2028-09-07', '2027-09-14', '2021-10-14', '2021-10-19')
dt_end2 <- c('2021-09-13', '2021-11-16', '2030-09-10', '2025-09-09', '2021-11-10', '2021-09-14')
dt_end3 <- c('2021-09-10', '2021-05-10', '2029-09-19', '2030-09-09', '2029-05-19', '2023-09-14')

#I have the following condition:
library(dplyr)

null <- ""

result <-
filter(df,
response == 1 &
((dt_end1 == null) |
(dt_end1 > dt_start)) &
((dt_end2 == null) |
(dt_end2 > dt_start)) &
((dt_end3 == null) |
(dt_end3 > dt_start)))

view(df)

'''
But it return a row of ID 3 because one row match the condition and not the other one
the expected result is that display only the rows of ID 11

I was thinking of adding a count column and then putting a condition to display the result only if the total number of rows is equal to the number of rows that match the condition but I don't really know how to do it and I also think that there might be a simpler solution .

Thank's.

Is this what you are wanting ?

library(tidyverse)

null <- ""

df <- data.frame(id = c(10, 10, 11, 11, 3, 3),
instance = c(1, 2, 1, 2, 1, 2),
response = c(0, 0, 1, 1, 1, 1),
dt_start = c('2021-05-11', '2021-05-11', '2021-05-11', '2021-05-11', '2021-09-15', '2021-09-15'),
dt_end1 = c('2015-09-09', '2021-09-14', '2028-09-07', '2027-09-14', '2021-10-14', '2021-10-19'),
dt_end2 = c('2021-09-13', '2021-11-16', '2030-09-10', '2025-09-09', '2021-11-10', '2021-09-14'),
dt_end3 = c('2021-09-10', '2021-05-10', '2029-09-19', '2030-09-09', '2029-05-19', '2023-09-14'))


semi_result <-
  mutate(df,
    some_condition =
      response == 1 &
        ((dt_end1 == null) |
          (dt_end1 > dt_start)) &
        ((dt_end2 == null) |
          (dt_end2 > dt_start)) &
        ((dt_end3 == null) |
          (dt_end3 > dt_start))
  )

(interim_df <- semi_result %>% 
    group_by(id) %>% 
    summarise(
  aggregated_condition =
    all(some_condition)
))

interim_df %>%
  filter(aggregated_condition) %>%
  left_join(semi_result)
1 Like

yes thank you it's exactly what i was looking for but can you explain me this part please:

'''

(interim_df <- semi_result %>%
group_by(id) %>%
summarise(
aggregated_condition =
all(some_condition)
))

'''

because i don't understand why it keep the Value FALSE when there are one row TRUE and the other FALSE for the same ID

all(TRUE,FALSE) is FALSE

1 Like

Also, Can you please give me more details about this part:

interim_df %>%
  filter(aggregated_condition) %>%
  left_join(semi_result)

because I don't understand why it keeps only id's with value TRUE and how the join is done, is it done by id by default? or by another variable.

Thank you.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.