Filter based on multiple conditions

clarktar · May 7, 2024, 6:53pm

I have the following data frame

library(dplyr)
dat3=data.frame(reach=c('T','T', 'T', 'T','C','C','C','C'), metric=c("% cover", "% rock", "% cover", "% rock", "% cover", "% rock", "% cover", "% rock"), 
                Value=c(0,1,1,0,1,0,1,0),site=c(1,1,2,2,1,1,2,2) )
dat3 <- dat3 |> 
  group_by(site,metric)
dat3
#> # A tibble: 8 × 4
#> # Groups:   site, metric [4]
#>   reach metric  Value  site
#>   <chr> <chr>   <dbl> <dbl>
#> 1 T     % cover     0     1
#> 2 T     % rock      1     1
#> 3 T     % cover     1     2
#> 4 T     % rock      0     2
#> 5 C     % cover     1     1
#> 6 C     % rock      0     1
#> 7 C     % cover     1     2
#> 8 C     % rock      0     2

^{Created on 2024-05-07 with reprex v2.1.0}

I have been trying to use a tidyverse approach to achieve the desired result. What I would like is for any case where Site has a 0 value for BOTH reaches for ANY metric it is filtered OUT of the data frame. In this example above I would end up with this data frame:

#> # A tibble: 6 × 4
#> # Groups:   site, metric [4]
#>   reach metric  Value  site
#>   <chr> <chr>   <dbl> <dbl>
#> 1 T     % cover     0     1
#> 2 T     % rock      1     1
#> 3 T     % cover     1     2
#> 4 C     % cover     1     1
#> 5 C     % rock      0     1
#> 6 C     % cover     1     2

I have tried group_by (Site, Metric) then have tried various ways to select for cases where value = 0 for BOTH Reach values(T and C) but have not been successful. This is an example data.frame. In reality there are 100's of unique Sites and each have a T and C value or Reach and there are 50 different metrics for each Site (a metric value for every T and C of a site). Hopefully that makes some sense!

Does anyone have any ideas on how to achieve the desired result using a tidyverse approach?

^{Created on 2024-05-07 with reprex v2.1.0}

prubin · May 7, 2024, 8:03pm

Assuming that Value is nonnegative, does dat3 |> group_by(site, metric) |> filter(max(Value) > 0) do what you want on the real data frame?

clarktar · May 7, 2024, 8:29pm

Yep!! That works great. Group is such a great function! Just to make sure I understand, after grouping by Site (which will include both T and C records) and Metric (so we will have a record for each Metric and Site (T and C) grouped and as long as that group has atleast one record with a max value >0 then the entire group is retained.

Correct?

prubin · May 7, 2024, 8:51pm

I think you've got it, but just to be safe this is what happens. The data is split into groups based on site and metric. So you have one group for site 1/% cover, one group for site 1/% rock, one group for site 2/% cover, etc. With 50 metrics, you will get 50 groups for each site. Now the filter operation is applied group by group. If any one record in a group has a positive value, the entire group survives. If all records in a group have value 0, the entire group is removed.

Depending on what you plan to do with the remaining data, you might want to apply the ungroup() function to it, so that subsequent operations are based on individual records and not groups. Grouping is "sticky" -- once a data frame is grouped, it stays grouped until you say otherwise.

dromano · May 8, 2024, 12:54pm

I don't know if you can edit your initial post still, but it seems your question is really:

What I would like is for any combination of site and metric, if both reaches have a value 0, then all rows for that combination are filtered out.

Which also suggests that your goal is more faithfully achieved (in a cognitive sense) with a different table:

library(dplyr)
library(tidyr)
dat3=data.frame(reach=c('T','T', 'T', 'T','C','C','C','C'), metric=c("% cover", "% rock", "% cover", "% rock", "% cover", "% rock", "% cover", "% rock"), 
                Value=c(0,1,1,0,1,0,1,0),site=c(1,1,2,2,1,1,2,2) )
dat3 |> 
  pivot_wider(names_from = reach, values_from = Value)
#> # A tibble: 4 × 4
#>   metric   site     T     C
#>   <chr>   <dbl> <dbl> <dbl>
#> 1 % cover     1     0     1
#> 2 % rock      1     1     0
#> 3 % cover     2     1     1
#> 4 % rock      2     0     0

dat3 |> 
  pivot_wider(names_from = reach, values_from = Value) |> 
  filter(T > 0 | C  >  0 )
#> # A tibble: 3 × 4
#>   metric   site     T     C
#>   <chr>   <dbl> <dbl> <dbl>
#> 1 % cover     1     0     1
#> 2 % rock      1     1     0
#> 3 % cover     2     1     1

^{Created on 2024-05-08 with reprex v2.0.2}

clarktar · May 8, 2024, 5:35pm

Yes, that approach works as well and your rewording of my statement also makes it potentially more clear. I do like the approach where I do not have to pivot to a wide format table in this case.

dromano · May 8, 2024, 5:48pm

I guess what I was wondering was whether some pivoting was already applied to get your original table from the raw data, or whether the raw data already comes in the form of your original table. Of course this is all beside the point relative to your original question.

system · May 15, 2024, 5:48pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.