filter() with if_else() or something else?

budugulo · April 1, 2023, 10:26am

> df_toy
# A tibble: 15 × 3
    week completion          cat  
   <int> <dttm>              <chr>
 1     2 2023-01-25 17:05:48 0000 
 2     2 2023-02-08 17:07:45 0000 
 3     2 2023-02-14 19:04:45 0000 
 4     3 2023-02-01 15:29:50 0000 
 5     3 2023-02-08 17:08:44 0000 
 6     3 2023-02-14 19:05:45 0000 
 7     4 2023-02-08 17:09:48 0000 
 8     4 2023-02-14 03:12:57 0000 
 9     4 2023-02-14 19:06:46 0000 
10     8 2023-03-08 18:21:53 0000 
11     8 2023-03-13 21:04:04 0000 
12     4 2023-02-08 19:00:47 PPPP 
13     4 2023-02-09 13:47:45 PPPP 
14     4 2023-02-14 21:30:45 PPPP 
15     8 2023-03-08 20:57:59 PPPP

Given the above tibble, how do I keep the maximum of completion when week is in the vector weeks_special and cat == "0000"; similarly, how can I keep the minimum of completion when week is not in the vector weeks_special but cat == "0000"?. For other cat I want to keep all the observations. I want to achieve this by week.

Suppose weeks_special <- c(4, 8), then my desired output is:

# A tibble: 8 × 3
   week completion          cat  
  <int> <dttm>              <chr>
1     2 2023-01-25 17:05:48 0000 
2     3 2023-02-01 15:29:50 0000 
3     4 2023-02-14 19:06:46 0000 
4     8 2023-03-13 21:04:04 0000 
5     4 2023-02-08 19:00:47 PPPP 
6     4 2023-02-09 13:47:45 PPPP 
7     4 2023-02-14 21:30:45 PPPP 
8     8 2023-03-08 20:57:59 PPPP

library(tidyverse)
# Toy data----
## vector====
weeks_special <- c(4, 8)
## df====
df_toy <- structure(list(week = c(2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 8L, 
                        8L, 4L, 4L, 4L, 8L), completion = structure(c(1674666348, 1675876065, 
                                                                      1676401485, 1675265390, 1675876124, 1676401545, 1675876188, 1676344377, 
                                                                      1676401606, 1678299713, 1678741444, 1675882847, 1675950465, 1676410245, 
                                                                      1678309079), tzone = "", class = c("POSIXct", "POSIXt")), cat = c("0000", 
                                                                                                                                        "0000", "0000", "0000", "0000", "0000", "0000", "0000", "0000", 
                                                                                                                                        "0000", "0000", "PPPP", "PPPP", "PPPP", "PPPP")), row.names = c(NA, 
                                                                                                                                                                                                        -15L), class = c("tbl_df", "tbl", "data.frame"))

scottyd22 · April 1, 2023, 12:09pm

Below is one approach to achieve the desired outcome using case_when(). Some of my dates/times are different than what you shared, so I included df_toy at the beginning.

df_toy
#> # A tibble: 15 × 3
#>     week completion          cat  
#>    <int> <dttm>              <chr>
#>  1     2 2023-01-25 12:05:48 0000 
#>  2     2 2023-02-08 12:07:45 0000 
#>  3     2 2023-02-14 14:04:45 0000 
#>  4     3 2023-02-01 10:29:50 0000 
#>  5     3 2023-02-08 12:08:44 0000 
#>  6     3 2023-02-14 14:05:45 0000 
#>  7     4 2023-02-08 12:09:48 0000 
#>  8     4 2023-02-13 22:12:57 0000 
#>  9     4 2023-02-14 14:06:46 0000 
#> 10     8 2023-03-08 13:21:53 0000 
#> 11     8 2023-03-13 17:04:04 0000 
#> 12     4 2023-02-08 14:00:47 PPPP 
#> 13     4 2023-02-09 08:47:45 PPPP 
#> 14     4 2023-02-14 16:30:45 PPPP 
#> 15     8 2023-03-08 15:57:59 PPPP

df_toy |>
  group_by(week, cat) |>
  mutate(keep = case_when(
    cat != '0000' ~ 1,
    cat == '0000' & week %in% weeks_special & completion == max(completion) ~ 1,
    cat == '0000' & !week %in% weeks_special & completion == min(completion) ~ 1,
    TRUE ~ 0
  )) |>
  ungroup() |>
  filter(keep == 1) |> 
  select(-keep)
#> # A tibble: 8 × 3
#>    week completion          cat  
#>   <int> <dttm>              <chr>
#> 1     2 2023-01-25 12:05:48 0000 
#> 2     3 2023-02-01 10:29:50 0000 
#> 3     4 2023-02-14 14:06:46 0000 
#> 4     8 2023-03-13 17:04:04 0000 
#> 5     4 2023-02-08 14:00:47 PPPP 
#> 6     4 2023-02-09 08:47:45 PPPP 
#> 7     4 2023-02-14 16:30:45 PPPP 
#> 8     8 2023-03-08 15:57:59 PPPP

Created on 2023-04-01 with reprex v2.0.2

budugulo · April 1, 2023, 4:02pm

@scottyd22 Many thanks for the solution!

I have also come up with a solution. But your one is much neater, I suppose.
Anyway, just posting it for future reference.

# using if_else
df_toy %>%
  group_by(week, cat) %>%
  filter(
    if_else(
      cat == "0000" & week %in% weeks_special, 
      completion == max(completion), 
      # else: 
      if_else(
        cat == "0000" & !(week %in% weeks_special), 
        completion == min(completion),
        TRUE
      )
    )
  ) %>%
  ungroup()
#> # A tibble: 8 × 3
#>    week completion          cat  
#>   <int> <dttm>              <chr>
#> 1     2 2023-01-25 17:05:48 0000 
#> 2     3 2023-02-01 15:29:50 0000 
#> 3     4 2023-02-14 19:06:46 0000 
#> 4     8 2023-03-13 21:04:04 0000 
#> 5     4 2023-02-08 19:00:47 PPPP 
#> 6     4 2023-02-09 13:47:45 PPPP 
#> 7     4 2023-02-14 21:30:45 PPPP 
#> 8     8 2023-03-08 20:57:59 PPPP

^{Created on 2023-04-01 with reprex v2.0.2}

system · April 8, 2023, 4:03pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.