Then would like to group the ID's and have a code such that a can loop through the tasks per subgroup. If a particular subgroup has a specific task I will remove it.
For example, the rule can be: delete complete subgroup if it contains a task c. Then in the example above, the complete ID=2 is removed and only ID=1 is returned with all its tasks, like:
ID TASK
1 a
1 a
1 b
Obviously, my real dataset is enourmous, and I would like to find a code that is easily to adapt to screen for different kinds of tasks (but same kind of idea).
And, as you may have guessed, I am a very new to coding and R. So I am very very gratefull with your help! And a simple solution would be awesome!
If it can be a continuation of the following, that would be great, since I probably would understand it the best than
Is there a specific reason you'd like to use a loop for this?
If not, you might consider doing a filter() on a grouped tibble/data frame and using the any() function (below, I've negated it using the logical not operator, !).
library(tidyverse)
ID <- c(rep(1,3), rep(2,3))
TASK <- c("a", "a", "b", "a", "c", "a")
df <- tibble(ID, TASK)
df %>%
group_by(ID) %>%
filter(!any(TASK == "c"))
#> # A tibble: 3 x 2
#> # Groups: ID [1]
#> ID TASK
#> <dbl> <chr>
#> 1 1 a
#> 2 1 a
#> 3 1 b
I always go to SQL although there may be other solutions. The steps I go through are:
Filter data to see which IDs have letter(s) which you want out of the data set.
anti_join the original data with the IDs that have the letter.
But it seems like Mara's solution is better than this.
library(tidyverse)
# Make data frame -----------------------------------------------
df<- tribble(
~ID, ~TASK,
1, 'a',
1, 'a',
1, 'b',
2, 'a',
2, 'c',
2, 'a'
)
# make filter vector. --------------------------------------------------
#add as many letters as you want
filter_out_letters = c('c')
#this is what we want to filter
df%>%
filter(TASK %in% filter_out_letters)%>%
distinct(ID)
# filter out the ID which has the filter letter(s) ----------------------------
anti_join(df,
df%>%
filter(TASK %in% filter_out_letters),
by = 'ID')
It's probably not important in this particular case, but perhaps it's better to ungroup() at the end.
Also, an alternative to !any(TASK == "c") will be all(TASK != "c"), which for some reason is more intuitive to me. It also saves me a keystroke.
Reprex
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(ID = rep(x = 1:2, each = 3),
TASK = c("a", "a", "b", "a", "c", "a"))
df %>%
group_by(ID) %>%
filter(all(TASK != "c")) %>%
ungroup()
#> # A tibble: 3 x 2
#> ID TASK
#> <int> <chr>
#> 1 1 a
#> 2 1 a
#> 3 1 b
You are my heroes!
Thank you so much for the amazingly fast and great replies!
Especially many thanks that the solutions are intuitively understandable and are a great add to my starting toolbox of R knowledge