delete subgroups conditionally

Maike · April 9, 2019, 10:06am

Dearest community,

If this is my input:

ID TASK
1 a
1 a
1 b
2 a
2 c
2 a

Then would like to group the ID's and have a code such that a can loop through the tasks per subgroup. If a particular subgroup has a specific task I will remove it.
For example, the rule can be: delete complete subgroup if it contains a task c. Then in the example above, the complete ID=2 is removed and only ID=1 is returned with all its tasks, like:

ID TASK
1 a
1 a
1 b

Obviously, my real dataset is enourmous, and I would like to find a code that is easily to adapt to screen for different kinds of tasks (but same kind of idea).

And, as you may have guessed, I am a very new to coding and R. So I am very very gratefull with your help! And a simple solution would be awesome!

If it can be a continuation of the following, that would be great, since I probably would understand it the best than

library(dplyr)
df %>%
arrange(ID) %>%
group_by(ID) %>%
...help...

Thanks a million!!!

mara · April 9, 2019, 10:46am

Is there a specific reason you'd like to use a loop for this?

If not, you might consider doing a filter() on a grouped tibble/data frame and using the any() function (below, I've negated it using the logical not operator, !).

library(tidyverse)

ID <- c(rep(1,3), rep(2,3))
TASK <- c("a", "a", "b", "a", "c", "a")

df <- tibble(ID, TASK)

df %>%
  group_by(ID) %>%
  filter(!any(TASK == "c"))
#> # A tibble: 3 x 2
#> # Groups:   ID [1]
#>      ID TASK 
#>   <dbl> <chr>
#> 1     1 a    
#> 2     1 a    
#> 3     1 b

^{Created on 2019-04-09 by the reprex package (v0.2.1)}

Col · April 9, 2019, 10:49am

I always go to SQL although there may be other solutions. The steps I go through are:

Filter data to see which IDs have letter(s) which you want out of the data set.
anti_join the original data with the IDs that have the letter.

But it seems like Mara's solution is better than this.

library(tidyverse)
# Make data frame -----------------------------------------------

df<- tribble(
~ID, ~TASK,
1, 'a',
1, 'a',
1, 'b',
2, 'a',
2, 'c',
2, 'a'
)

# make filter vector. --------------------------------------------------

#add as many letters as you want
filter_out_letters = c('c')

#this is what we want to filter
df%>%
  filter(TASK %in% filter_out_letters)%>%
  distinct(ID)


# filter out the ID which has the filter letter(s) ----------------------------

anti_join(df,
          df%>%
            filter(TASK %in% filter_out_letters),
          by = 'ID')

Yarnabrina · April 9, 2019, 12:03pm

It's probably not important in this particular case, but perhaps it's better to ungroup() at the end.

Also, an alternative to !any(TASK == "c") will be all(TASK != "c"), which for some reason is more intuitive to me. It also saves me a keystroke.

Reprex

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(ID = rep(x = 1:2, each = 3),
             TASK = c("a", "a", "b", "a", "c", "a"))

df %>%
  group_by(ID) %>%
  filter(all(TASK != "c")) %>%
  ungroup()
#> # A tibble: 3 x 2
#>      ID TASK 
#>   <int> <chr>
#> 1     1 a    
#> 2     1 a    
#> 3     1 b

Maike · April 9, 2019, 7:04pm

You are my heroes!
Thank you so much for the amazingly fast and great replies!
Especially many thanks that the solutions are intuitively understandable and are a great add to my starting toolbox of R knowledge

system · April 16, 2019, 7:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.