Filter dataframe to only included rows if all rows of the same id match the criteria

hich · September 27, 2021, 8:41am

Hi,
For each ID i want to display all rows of the same id if all rows match with the condition,
so if at least one row is not matching, no row with the id must be displayed
so i have the following condition

'''
#Assume the following dataframe:

id <- c(10, 10, 11, 11, 3, 3)
instance <- c(1, 2, 1, 2, 1, 2)
response <- c(0, 0, 1, 1, 1, 1)
dt_start <- c('2021-05-11', '2021-05-11', '2021-05-11', '2021-05-11', '2021-09-15', '2021-09-15')
dt_end1 <- c('2015-09-09', '2021-09-14', '2028-09-07', '2027-09-14', '2021-10-14', '2021-10-19')
dt_end2 <- c('2021-09-13', '2021-11-16', '2030-09-10', '2025-09-09', '2021-11-10', '2021-09-14')
dt_end3 <- c('2021-09-10', '2021-05-10', '2029-09-19', '2030-09-09', '2029-05-19', '2023-09-14')

#I have the following condition:
library(dplyr)

null <- ""

result <-
filter(df,
response == 1 &
((dt_end1 == null) |
(dt_end1 > dt_start)) &
((dt_end2 == null) |
(dt_end2 > dt_start)) &
((dt_end3 == null) |
(dt_end3 > dt_start)))

view(df)

'''
But it return a row of ID 3 because one row match the condition and not the other one
the expected result is that display only the rows of ID 11

I was thinking of adding a count column and then putting a condition to display the result only if the total number of rows is equal to the number of rows that match the condition but I don't really know how to do it and I also think that there might be a simpler solution .

Thank's.

nirgrahamuk · September 27, 2021, 9:48am

Is this what you are wanting ?

library(tidyverse)

null <- ""

df <- data.frame(id = c(10, 10, 11, 11, 3, 3),
instance = c(1, 2, 1, 2, 1, 2),
response = c(0, 0, 1, 1, 1, 1),
dt_start = c('2021-05-11', '2021-05-11', '2021-05-11', '2021-05-11', '2021-09-15', '2021-09-15'),
dt_end1 = c('2015-09-09', '2021-09-14', '2028-09-07', '2027-09-14', '2021-10-14', '2021-10-19'),
dt_end2 = c('2021-09-13', '2021-11-16', '2030-09-10', '2025-09-09', '2021-11-10', '2021-09-14'),
dt_end3 = c('2021-09-10', '2021-05-10', '2029-09-19', '2030-09-09', '2029-05-19', '2023-09-14'))


semi_result <-
  mutate(df,
    some_condition =
      response == 1 &
        ((dt_end1 == null) |
          (dt_end1 > dt_start)) &
        ((dt_end2 == null) |
          (dt_end2 > dt_start)) &
        ((dt_end3 == null) |
          (dt_end3 > dt_start))
  )

(interim_df <- semi_result %>% 
    group_by(id) %>% 
    summarise(
  aggregated_condition =
    all(some_condition)
))

interim_df %>%
  filter(aggregated_condition) %>%
  left_join(semi_result)

hich · September 27, 2021, 10:13am

yes thank you it's exactly what i was looking for but can you explain me this part please:

'''

(interim_df <- semi_result %>%
group_by(id) %>%
summarise(
aggregated_condition =
all(some_condition)
))

'''

because i don't understand why it keep the Value FALSE when there are one row TRUE and the other FALSE for the same ID

nirgrahamuk · September 27, 2021, 10:15am

all(TRUE,FALSE) is FALSE

hich · September 28, 2021, 2:19pm

Also, Can you please give me more details about this part:

interim_df %>%
  filter(aggregated_condition) %>%
  left_join(semi_result)

because I don't understand why it keeps only id's with value TRUE and how the join is done, is it done by id by default? or by another variable.

Thank you.

system · October 5, 2021, 2:20pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.