Case_when in combination with other dplyr verbs

dpolychr · October 26, 2017, 10:04pm

Dear all,

I have the following tibble my_tbl:

Groups:   group_id [1]
    group_id process
       <chr>   <chr>
 1 111000820       1
 2 111000820       1
 3 111000820       1
 4 111000820       1
 5 111000820   10000
 6 111000820   10000
 7 111000820   10000
 8 111000820   10000
 9 111000820   10001
10 111000820   10001
11 111000820   10001
12 111000820   10001
13 111000820   10002
14 111000820   10002
15 111000820   10002
16 111000820   10002
17 111000820   10003
18 111000820   10003
19 111000820   10003
20 111000820   10003

It is grouped by group_id column. Here only one group is displayed for simplicity.
What I would like to do in pseudocode is:

(for every group) for the processes that are > 10000, retrieve the max process, i.e. 10003 (and also the processes that are 1)

Something like:

1 111000820 1
2 111000820 1
3 111000820 1
4 111000820 1
5 111000820 10003
6 111000820 10003
7 111000820 10003
8 111000820 10003

(I have other columns as well in this tibble which I am not displaying here)

It would be nice and intuitive for me to combine case_when with filter and do sth like:

my_tbl %>% group_by(group_id) %>% mutate(process = case_when(process >= 10000 ~ filter(process == max(process)), TRUE ~ process)) %>% filter(group_id == "111000820")

but i know this does not work.
Any suggestions would be greatly appreciated.

Thanks,
Dimitris

nick · October 26, 2017, 10:26pm

I'm not 100% sure what you mean by "retrieve the max process", but you will likely get what you want by replacing filter(process = max(process)) with max(process). Filter turns a large data frame into a smaller data frame, and I don't think you actually want to replace multiple rows of the process column with embedded data frames.

dpolychr · October 26, 2017, 10:51pm

Hello Nick,

By max(process) i mean 10003 in this case (I edited the question for that to be more clear).
I have tried this before and now as you suggested:

my_tbl %>% group_by(group_id) %>% mutate(process = case_when(process >= 10000 ~ max(process), TRUE ~ process)) %>% filter(group_id == “111000820”)

but I get all 10003 (this command substitutes all that are greater than 10000 to 10003). That would not be what I want. Instead what I would like would be to return only the following:

1 111000820 1
2 111000820 1
3 111000820 1
4 111000820 1
5 111000820 10003
6 111000820 10003
7 111000820 10003
8 111000820 10003

(I have other columns as well in this tibble which I am not displaying here)

thoughtfulnz · October 27, 2017, 12:39am

For each group means a group by, then you can use max to find the maximum for each group, and filter for 1 OR the group maximum

txtdt <- "group_id process
111000820       1
111000820       1
111000820   10001
111000820   10003
111000820   10003
111000821       1
111000821       1
111000821   10001
111000821   10002
111000821   10002"

example <- read.table(text=txtdt, header=TRUE)

library(dplyr)

example %>% group_by(group_id) %>%
filter(process == max(process) | process == 1)

and maybe throw in an ungroup() at the end depending on what is next.
by having the filter after the group_by it is at the group level, and by using OR in the filter it finds both groups you want.