[dplyr+tidyr] Filter Based of Multiple Column Values

Hi Posit Community.

Yet another filtering question for you if I may.

Lets say I have two dataframes/tables. One that contains a very large amount of data I need to filter based on two or more criteria, and another containing the unique pairings of rows upon which said criteria is based.

As a simple example, filtering dataframe df:

Participant Category Group Rating
Greg Int A 21
Greg Int B 20
Donna Post B 10
Donna Post A 11
Johnathan Fac A 20
Johnathan Fac B 10
Lewis Int B 11
Lewis Int A 20
df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Category = c('Int', 'Int', 'Post', 'Post','Fac','Fac','Int','Int'),
  Group = c("A","B","B","A","A","B","B","A"),
  Rating = c(21, 20, 10, 11, 20, 10, 11, 20)
  )

Using the criteria in dataframe filtercrit:

Group_ID Category Group
Criteria_1 Int A
Criteria_2 Post B

To get an output like the following where the only entries retained are those that meet the paired criteria from Category and Group like so:

Participant Category Group Rating
Greg Int A 21
Donna Post B 10
Lewis Int A 20

My typical strategy of using %in% per column would be too inclusive, and my usual workaround of mutate() and pasting together the strings across all columns of interest into a unique string followed by using %in% would take an unrealistic amount of time to generate give then size of the real data I have (e.g. 2k filtering criteria pairs and several hundred thousand entries).

Is there a cleaner way to do this in dplyr or tidyr that I'm missing?

Thank you in advance!

Looks like you might want a semijoin:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Category = c('Int', 'Int', 'Post', 'Post','Fac','Fac','Int','Int'),
  Group = c("A","B","B","A","A","B","B","A"),
  Rating = c(21, 20, 10, 11, 20, 10, 11, 20)
  )

library(tibble)
tribble(
     ~Group_ID, ~Category, ~Group,
  "Criteria_1",     "Int",    "A",
  "Criteria_2",    "Post",    "B"
  ) -> filtercrit

library(dplyr)
df |> 
  semi_join(filtercrit)
#> Joining with `by = join_by(Category, Group)`
#>   Participant Category Group Rating
#> 1        Greg      Int     A     21
#> 2       Donna     Post     B     10
#> 3       Lewis      Int     A     20

Created on 2024-07-19 with reprex v2.0.2

(Retains rows of left table that match right table.)

4 Likes

You can perform an inner join:

library(dplyr)

>  since you do not want the 'Group_ID' column in the final dataframe, removing  the col from filtercrit.

filtercrit<- filtercrit[, c('Category', 'Group')]
 output<- inner_join(df, filtercrit, by = c('Category' , 'Group'))
1 Like

Hi @dromano and @Akash01

I think the joins are a great solution by themselves, but in practice, I'm also adding some additional criteria that (in hindsight) I might require something more nuanced.

For example, lets say I wanted the following:

Participant Category Group Rating
Greg Int A 21
Donna Post B 10
Johnathan Fac A 20
Johnathan Fac B 10
Lewis Int A 20

With the additional criteria of the Category by Group pairing occurring in a another criteria dataframe, OR Category = "Fac" in the dplyr syntax of:

df |> filter(Category == "Fac" | <pairs of interest in dataframe filtercrit>)

Would there be a way to pass the argument here?

Or would I be better off doing an inner_join() or semi_join() to get one set, then filtering by the other and then binding the rows to get the output of interest?

Thanks!

You could use an antijoin, where you specify non-matches instead:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Category = c('Int', 'Int', 'Post', 'Post','Fac','Fac','Int','Int'),
  Group = c("A","B","B","A","A","B","B","A"),
  Rating = c(21, 20, 10, 11, 20, 10, 11, 20)
  )

library(tibble)
tribble(
     ~Group_ID, ~Category, ~Group,
  "Anti_criteria_1",     "Int",    "B",
  "Anti_criteria_2",    "Post",    "A"
  ) -> filtercrit2

library(dplyr)
df |> 
  anti_join(filtercrit2)
#> Joining with `by = join_by(Category, Group)`
#>   Participant Category Group Rating
#> 1        Greg      Int     A     21
#> 2       Donna     Post     B     10
#> 3   Johnathan      Fac     A     20
#> 4   Johnathan      Fac     B     10
#> 5       Lewis      Int     A     20

Created on 2024-07-21 with reprex v2.0.2

In the example it would work, but the anti_join() for my actual data would likely be too expansive.....

The row_bind() with the other filtering method plus the semi or inner join may be the way to go then.

I'll think some more, but you can always expand the filtering table too:

``` r
df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Category = c('Int', 'Int', 'Post', 'Post','Fac','Fac','Int','Int'),
  Group = c("A","B","B","A","A","B","B","A"),
  Rating = c(21, 20, 10, 11, 20, 10, 11, 20)
  )

library(tibble)
tribble(
     ~Group_ID, ~Category, ~Group,
  "Criteria_1",     "Int",    "A",
  "Criteria_2",    "Post",    "B",
  "Criteria_3",    "Fac",    "A",
  "Criteria_4",    "Fac",    "B",
  ) -> filtercrit3
library(dplyr)
df |> 
  semi_join(filtercrit3)
#> Joining with `by = join_by(Category, Group)`
#>   Participant Category Group Rating
#> 1        Greg      Int     A     21
#> 2       Donna     Post     B     10
#> 3   Johnathan      Fac     A     20
#> 4   Johnathan      Fac     B     10
#> 5       Lewis      Int     A     20

Created on 2024-07-21 with reprex v2.0.2

How extensive would you say the criteria would be be? That might suggest a different approach, too.

Not particularly extensive but complicated.

Picture a scenario where I'm just looking for instances with "Fac" but don't necessarily care about the values of Group because there can be several hundred unique values on top of the pairing example already described.

Know it's not a lot to go on but only so much I can do with large datasets with protected info.

Could try scrambling some values here and there but it's always a bit of a tightrope.

1 Like

What about this then? You could keep separate inclusion criteria tables and use map() to apply them individually and collect the results:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Category = c('Int', 'Int', 'Post', 'Post','Fac','Fac','Int','Int'),
  Group = c("A","B","B","A","A","B","B","A"),
  Rating = c(21, 20, 10, 11, 20, 10, 11, 20)
  )

library(tibble)
tribble(
     ~Group_ID, ~Category, ~Group,
  "Criteria_1",     "Int",    "A",
  "Criteria_2",    "Post",    "B",
  ) -> filtercrit1

tribble(
     ~Group_ID, ~Category,
  "Criteria_3",    "Fac", 
  ) -> filtercrit2
library(dplyr)
library(purrr) # map() and list_rbind()
list(filtercrit1, filtercrit2) |> 
  map(\(crit) df |> semi_join(crit)) |> 
  list_rbind()
#> Joining with `by = join_by(Category, Group)`
#> Joining with `by = join_by(Category)`
#>   Participant Category Group Rating
#> 1        Greg      Int     A     21
#> 2       Donna     Post     B     10
#> 3       Lewis      Int     A     20
#> 4   Johnathan      Fac     A     20
#> 5   Johnathan      Fac     B     10

Created on 2024-07-22 with reprex v2.0.2

Hi all.

Just wanted to follow up with a discovery.

If you don't want to, or are in a position in which you are unable to use the joins or binds, I have found that you can create filter variables on the fly with a direct call to the column of interest as part of your filtering critera.

It's considerably uglier than the solutions proposed here, but would look something like this:

df |> filter(((Category == (filtercrit |> pull(Category))[1] & Group == (filtercrit |> pull(Group))[1])) | (Category == (filtercrit |> pull(Category))[2] & Group == (filtercrit |> pull(Group))[2]))

But you can actually adapt this to do additional filtering criteria or pull entire columns.

Reorganizing the code for legibility:

df |> 
  filter(
    (
      Category == (filtercrit |> pull(Category))[1] & 
        Group == (filtercrit |> pull(Group))[1]
    ) | 
      (
        Category == (filtercrit |> pull(Category))[2] & 
          Group == (filtercrit |> pull(Group))[2]
      )
  )

so it's clear that it says "keep rows where Catergoy and Group match the first row of filtercrit or the second row of filtercrit". Perfectly reasonable on the fly, although since you're using the [ operator, I'd probably suggest this to avoid extra typing:

df |> 
  filter(
    Category == filtercrit$Category[1] &  Group == filtercrit$Group[1] | 
      Category == filtercrit$Category[2] &  Group == filtercrit$Group[2]
  )
2 Likes

Fair for when you're using base R, tibbles, and dataframes.

However, things like arrow and duckdb don't seem to care for the $ operator and doing pull() is the best workaround I could manage when working with datasets so large I need to employ either.