[dplyr+tidyr] Filter Based of Multiple Column Values

TPDeRamus · July 20, 2024, 12:45am

Hi Posit Community.

Yet another filtering question for you if I may.

Lets say I have two dataframes/tables. One that contains a very large amount of data I need to filter based on two or more criteria, and another containing the unique pairings of rows upon which said criteria is based.

As a simple example, filtering dataframe df:

Participant	Category	Group	Rating
Greg	Int	A	21
Greg	Int	B	20
Donna	Post	B	10
Donna	Post	A	11
Johnathan	Fac	A	20
Johnathan	Fac	B	10
Lewis	Int	B	11
Lewis	Int	A	20

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Category = c('Int', 'Int', 'Post', 'Post','Fac','Fac','Int','Int'),
  Group = c("A","B","B","A","A","B","B","A"),
  Rating = c(21, 20, 10, 11, 20, 10, 11, 20)
  )

Using the criteria in dataframe filtercrit:

Group_ID	Category	Group
Criteria_1	Int	A
Criteria_2	Post	B

To get an output like the following where the only entries retained are those that meet the paired criteria from Category and Group like so:

Participant	Category	Group	Rating
Greg	Int	A	21
Donna	Post	B	10
Lewis	Int	A	20

My typical strategy of using %in% per column would be too inclusive, and my usual workaround of mutate() and pasting together the strings across all columns of interest into a unique string followed by using %in% would take an unrealistic amount of time to generate give then size of the real data I have (e.g. 2k filtering criteria pairs and several hundred thousand entries).

Is there a cleaner way to do this in dplyr or tidyr that I'm missing?

Thank you in advance!

dromano · July 20, 2024, 12:52am

Looks like you might want a semijoin:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Category = c('Int', 'Int', 'Post', 'Post','Fac','Fac','Int','Int'),
  Group = c("A","B","B","A","A","B","B","A"),
  Rating = c(21, 20, 10, 11, 20, 10, 11, 20)
  )

library(tibble)
tribble(
     ~Group_ID, ~Category, ~Group,
  "Criteria_1",     "Int",    "A",
  "Criteria_2",    "Post",    "B"
  ) -> filtercrit

library(dplyr)
df |> 
  semi_join(filtercrit)
#> Joining with `by = join_by(Category, Group)`
#>   Participant Category Group Rating
#> 1        Greg      Int     A     21
#> 2       Donna     Post     B     10
#> 3       Lewis      Int     A     20

^{Created on 2024-07-19 with reprex v2.0.2}

(Retains rows of left table that match right table.)

Akash01 · July 21, 2024, 9:40am

You can perform an inner join:

library(dplyr)

>  since you do not want the 'Group_ID' column in the final dataframe, removing  the col from filtercrit.

filtercrit<- filtercrit[, c('Category', 'Group')]
 output<- inner_join(df, filtercrit, by = c('Category' , 'Group'))

TPDeRamus · July 22, 2024, 12:06am

Hi @dromano and @Akash01

I think the joins are a great solution by themselves, but in practice, I'm also adding some additional criteria that (in hindsight) I might require something more nuanced.

For example, lets say I wanted the following:

Participant	Category	Group	Rating
Greg	Int	A	21
Donna	Post	B	10
Johnathan	Fac	A	20
Johnathan	Fac	B	10
Lewis	Int	A	20

With the additional criteria of the Category by Group pairing occurring in a another criteria dataframe, OR Category = "Fac" in the dplyr syntax of:

df |> filter(Category == "Fac" | <pairs of interest in dataframe filtercrit>)

Would there be a way to pass the argument here?

Or would I be better off doing an inner_join() or semi_join() to get one set, then filtering by the other and then binding the rows to get the output of interest?

Thanks!

dromano · July 22, 2024, 12:30am

You could use an antijoin, where you specify non-matches instead:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Category = c('Int', 'Int', 'Post', 'Post','Fac','Fac','Int','Int'),
  Group = c("A","B","B","A","A","B","B","A"),
  Rating = c(21, 20, 10, 11, 20, 10, 11, 20)
  )

library(tibble)
tribble(
     ~Group_ID, ~Category, ~Group,
  "Anti_criteria_1",     "Int",    "B",
  "Anti_criteria_2",    "Post",    "A"
  ) -> filtercrit2

library(dplyr)
df |> 
  anti_join(filtercrit2)
#> Joining with `by = join_by(Category, Group)`
#>   Participant Category Group Rating
#> 1        Greg      Int     A     21
#> 2       Donna     Post     B     10
#> 3   Johnathan      Fac     A     20
#> 4   Johnathan      Fac     B     10
#> 5       Lewis      Int     A     20

^{Created on 2024-07-21 with reprex v2.0.2}

TPDeRamus · July 22, 2024, 2:07am

In the example it would work, but the anti_join() for my actual data would likely be too expansive.....

The row_bind() with the other filtering method plus the semi or inner join may be the way to go then.

dromano · July 22, 2024, 2:11am

I'll think some more, but you can always expand the filtering table too:

``` r
df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Category = c('Int', 'Int', 'Post', 'Post','Fac','Fac','Int','Int'),
  Group = c("A","B","B","A","A","B","B","A"),
  Rating = c(21, 20, 10, 11, 20, 10, 11, 20)
  )

library(tibble)
tribble(
     ~Group_ID, ~Category, ~Group,
  "Criteria_1",     "Int",    "A",
  "Criteria_2",    "Post",    "B",
  "Criteria_3",    "Fac",    "A",
  "Criteria_4",    "Fac",    "B",
  ) -> filtercrit3

library(dplyr)
df |> 
  semi_join(filtercrit3)
#> Joining with `by = join_by(Category, Group)`
#>   Participant Category Group Rating
#> 1        Greg      Int     A     21
#> 2       Donna     Post     B     10
#> 3   Johnathan      Fac     A     20
#> 4   Johnathan      Fac     B     10
#> 5       Lewis      Int     A     20

^{Created on 2024-07-21 with reprex v2.0.2}

How extensive would you say the criteria would be be? That might suggest a different approach, too.

TPDeRamus · July 22, 2024, 8:00pm

Not particularly extensive but complicated.

Picture a scenario where I'm just looking for instances with "Fac" but don't necessarily care about the values of Group because there can be several hundred unique values on top of the pairing example already described.

Know it's not a lot to go on but only so much I can do with large datasets with protected info.

Could try scrambling some values here and there but it's always a bit of a tightrope.

dromano · July 22, 2024, 8:47pm

What about this then? You could keep separate inclusion criteria tables and use map() to apply them individually and collect the results:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna','Johnathan','Johnathan','Lewis','Lewis'),
  Category = c('Int', 'Int', 'Post', 'Post','Fac','Fac','Int','Int'),
  Group = c("A","B","B","A","A","B","B","A"),
  Rating = c(21, 20, 10, 11, 20, 10, 11, 20)
  )

library(tibble)
tribble(
     ~Group_ID, ~Category, ~Group,
  "Criteria_1",     "Int",    "A",
  "Criteria_2",    "Post",    "B",
  ) -> filtercrit1

tribble(
     ~Group_ID, ~Category,
  "Criteria_3",    "Fac", 
  ) -> filtercrit2

library(dplyr)
library(purrr) # map() and list_rbind()
list(filtercrit1, filtercrit2) |> 
  map(\(crit) df |> semi_join(crit)) |> 
  list_rbind()
#> Joining with `by = join_by(Category, Group)`
#> Joining with `by = join_by(Category)`
#>   Participant Category Group Rating
#> 1        Greg      Int     A     21
#> 2       Donna     Post     B     10
#> 3       Lewis      Int     A     20
#> 4   Johnathan      Fac     A     20
#> 5   Johnathan      Fac     B     10

^{Created on 2024-07-22 with reprex v2.0.2}

TPDeRamus · August 2, 2024, 5:27pm

Hi all.

Just wanted to follow up with a discovery.

If you don't want to, or are in a position in which you are unable to use the joins or binds, I have found that you can create filter variables on the fly with a direct call to the column of interest as part of your filtering critera.

It's considerably uglier than the solutions proposed here, but would look something like this:

df |> filter(((Category == (filtercrit |> pull(Category))[1] & Group == (filtercrit |> pull(Group))[1])) | (Category == (filtercrit |> pull(Category))[2] & Group == (filtercrit |> pull(Group))[2]))

But you can actually adapt this to do additional filtering criteria or pull entire columns.

dromano · August 3, 2024, 12:22pm

TPDeRamus:

df |> filter(((Category == (filtercrit |> pull(Category))[1] & Group == (filtercrit |> pull(Group))[1])) | (Category == (filtercrit |> pull(Category))[2] & Group == (filtercrit |> pull(Group))[2]))

Reorganizing the code for legibility:

df |> 
  filter(
    (
      Category == (filtercrit |> pull(Category))[1] & 
        Group == (filtercrit |> pull(Group))[1]
    ) | 
      (
        Category == (filtercrit |> pull(Category))[2] & 
          Group == (filtercrit |> pull(Group))[2]
      )
  )

so it's clear that it says "keep rows where Catergoy and Group match the first row of filtercrit or the second row of filtercrit". Perfectly reasonable on the fly, although since you're using the [ operator, I'd probably suggest this to avoid extra typing:

df |> 
  filter(
    Category == filtercrit$Category[1] &  Group == filtercrit$Group[1] | 
      Category == filtercrit$Category[2] &  Group == filtercrit$Group[2]
  )

TPDeRamus · September 17, 2024, 6:01pm

Fair for when you're using base R, tibbles, and dataframes.

However, things like arrow and duckdb don't seem to care for the $ operator and doing pull() is the best workaround I could manage when working with datasets so large I need to employ either.

system · December 16, 2024, 6:01pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.