I think there is a way of doing this using tidyr
and purrr
to accomplish this. Something alongside of loop and double loop. Maybe just my hunch. This is the closest I got.
If I understand OP's post, then the rules as follows.
For each ID :
- If Case has all 1 or 0, then take the oldest
Date_created
- If else, then take the oldest
Date_created
when the case
is 1
First, let's import our data.
library(tidyverse)
library(magrittr)
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
#> The following object is masked from 'package:tidyr':
#>
#> extract
foo_tbl <- tibble::tribble(
~ID, ~Order, ~Case, ~Date_created,
123456, 25800265, 1, "2018-06-27 07:40:23",
123456, 25800265, 1, "2018-06-25 05:29:23",
123456, 25800265, 0, "2018-07-26 06:16:28",
789454, 25906588, 1, "2018-07-12 05:59:50",
789454, 25906588, 0, "2018-07-12 07:41:29",
789454, 25906588, 0, "2018-07-10 05:43:45",
789454, 25906588, 0, "2018-07-09 05:59:26",
789454, 25906588, 0, "2018-07-05 10:39:45",
287541, 32140567, 0, "2018-07-12 07:41:29",
287541, 32140567, 0, "2018-07-10 05:43:45",
287541, 32140567, 0, "2018-07-09 05:59:26",
287541, 32140567, 0, "2018-07-05 10:39:45"
)
To check if the case contains all 1 or 0, I will need to know how much of ID (ID_n
) and the sum of case (case_sum
). If case_sum
matches ID_n
then It contains all 1, if 0 then all contains 0.
foo_tbl <- foo_tbl %>%
arrange(ID, Date_created)
by_ID <- foo_tbl %>%
group_by(ID) %>%
mutate(
ID_n = n(),
case_sum= sum(Case)
)
by_ID
#> # A tibble: 12 x 6
#> # Groups: ID [3]
#> ID Order Case Date_created ID_n case_sum
#> <dbl> <dbl> <dbl> <chr> <int> <dbl>
#> 1 123456 25800265 1 2018-06-25 05:29:23 3 2
#> 2 123456 25800265 1 2018-06-27 07:40:23 3 2
#> 3 123456 25800265 0 2018-07-26 06:16:28 3 2
#> 4 287541 32140567 0 2018-07-05 10:39:45 4 0
#> 5 287541 32140567 0 2018-07-09 05:59:26 4 0
#> 6 287541 32140567 0 2018-07-10 05:43:45 4 0
#> 7 287541 32140567 0 2018-07-12 07:41:29 4 0
#> 8 789454 25906588 0 2018-07-05 10:39:45 5 1
#> 9 789454 25906588 0 2018-07-09 05:59:26 5 1
#> 10 789454 25906588 0 2018-07-10 05:43:45 5 1
#> 11 789454 25906588 1 2018-07-12 05:59:50 5 1
#> 12 789454 25906588 0 2018-07-12 07:41:29 5 1
Then I told R to return ID based on that condition, and then filter the table according to each condition.
# Return ID when case_sum is equal to ID_n (all 1s) OR case_sum equals to 0 (all 0s)
ID_true <- by_ID %$% ID[(case_sum == ID_n | case_sum == 0)] %>% unique()
ID_true
#> [1] 287541
# Return ID for everything else above (has both 1 and 0 in 'case')
ID_false <- by_ID %$% ID[(case_sum != ID_n & case_sum != 0)] %>% unique()
ID_false
#> [1] 123456 789454
# for ID that has all 1 or 0 in 'Case'
ID_all_1_or_0 <- by_ID %>%
filter(ID %>% is_in(ID_true)) %>%
filter(Date_created == min(Date_created)) # Take the oldest date
# for ID that has both 1 and 0 in 'Case'
ID_both_1_and_0 <- by_ID %>%
filter(ID %>% is_in(ID_false)) %>%
filter(Case == 1) %>% # show only when the case is 1
filter(Date_created == min(Date_created)) # take the oldest date
Lastly, union them back.
ID_all_1_or_0 %>%
union(ID_both_1_and_0) %>%
select(-ID_n, -case_sum) %>%
arrange(ID)
#> # A tibble: 3 x 4
#> # Groups: ID [3]
#> ID Order Case Date_created
#> <dbl> <dbl> <dbl> <chr>
#> 1 123456 25800265 1 2018-06-25 05:29:23
#> 2 287541 32140567 0 2018-07-05 10:39:45
#> 3 789454 25906588 1 2018-07-12 05:59:50
Created on 2018-12-27 by the reprex package (v0.2.1)