How do I filter rows across a large data set to remove unwanted values?

LukeColbeck · June 7, 2022, 1:24pm

Hi All,

I'm aiming to make a script that can be used to filter and tidy faunal abundance data to prepare it for multivariate analyses (as would be required for analysis in something like Primer) that is faster than sorting it manually through Excel.

I am dealing with a large faunal abundance data set which contains a mix of numeric values, and "P" (Present) values for where taxa were observed but could not be accurately counted. I'd like to learn how I can purge rows containing any "P" values without having to go column by column.

Here's a summary of the data I'm looking at. Only one "P" is visible in the first row, but a fair few more are scattered throughout the data:

# A tibble: 383 × 111
   Code  Phylum     `Taxa ID`   Qualifiers S001_F1 S001_F2 S001_F3 S002_F1 S002_F2 S002_F3
   <chr> <chr>      <chr>       <chr>      <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
 1 A5025 Ciliophora Ciliophora  NA         -       -       -       P       -       -      
 2 A5050 Ciliophora Folliculin… NA         -       -       -       -       -       -      
 3 C0001 Porifera   Porifera    NA         -       -       -       -       -       -      
 4 C0053 Porifera   Leucosolen… NA         -       -       -       -       -       -      
 5 C0133 Porifera   Sycon       NA         -       -       -       -       -       -      
 6 D0158 Cnidaria   Tubulariid… NA         -       -       -       -       -       -      
 7 D0218 Cnidaria   Eudendrium  NA         -       -       -       -       -       -      
 8 D0246 Cnidaria   Bougainvil… NA         -       -       -       -       -       -      
 9 D0335 Cnidaria   Lovenella … NA         -       -       -       -       -       -      
10 D0343 Cnidaria   Phialella … NA         -       -       -       -       -       -      
# … with 373 more rows, and 101 more variables:

And the script I've managed so far (not much unfortunately!). My attempt so far works for one column to remove rows, but ideally I don't want to have to go column by column!

#1. Keep only rows that contain "aggregate" or "female" or "NA" qualifiers
AbundanceFilteredQualifiers2 <- SeaLinkFauna11_PRIMER[SeaLinkFauna11_PRIMER$Qualifiers %in% c('aggregate', 'female', NA),]

#2. Drop rows with empty phylum
AbundanceFilteredPhylum <- SeaLinkFauna11_PRIMER %>% filter_at(vars(Phylum), all_vars(!is.na(.)))

#3. Remove rows containing "P"
Presence <- subset(SeaLinkFauna11_PRIMER, S002_F1 != 'P')

Thanks in advance!

FJCC · June 7, 2022, 2:22pm

Here is a simplified example using a data set I manually constructed. In the future, please try to provide a data set that can be easily copied by others. See the link at the end of this post.

library(dplyr)
DF <- data.frame(Code = c("A1", "A2", "B1"),
                 Phylum = c("A", "A", "B"),
                 S001_F1 = c("-", "-", "-"),
                 S001_F2 = c("-", "P", "-"),
                 S002_F1 = c("-", "-", "-"))
DF
#>   Code Phylum S001_F1 S001_F2 S002_F1
#> 1   A1      A       -       -       -
#> 2   A2      A       -       P       -
#> 3   B1      B       -       -       -
DFnew <- DF |> rowwise() |> 
  mutate(FindP = any(c_across(cols = S001_F1:S002_F1) == "P"))
DFnew
#> # A tibble: 3 x 6
#> # Rowwise: 
#>   Code  Phylum S001_F1 S001_F2 S002_F1 FindP
#>   <chr> <chr>  <chr>   <chr>   <chr>   <lgl>
#> 1 A1    A      -       -       -       FALSE
#> 2 A2    A      -       P       -       TRUE 
#> 3 B1    B      -       -       -       FALSE
DFnew <- DFnew |> filter(!FindP)
DFnew
#> # A tibble: 2 x 6
#> # Rowwise: 
#>   Code  Phylum S001_F1 S001_F2 S002_F1 FindP
#>   <chr> <chr>  <chr>   <chr>   <chr>   <lgl>
#> 1 A1    A      -       -       -       FALSE
#> 2 B1    B      -       -       -       FALSE

^{Created on 2022-06-07 by the reprex package (v2.0.1)}

FAQ: How to do a minimal reproducible example ( reprex ) for beginners - meta / Guides & FAQs - RStudio Community

LukeColbeck · June 7, 2022, 2:47pm

Thank you, that worked for my data and will hopefully save a lot of time in the future.

Shall make sure to provide a simplified data set for the inevitable questions I ask later. Thanks for taking the time to make a simplified data set yourself and providing the solution.

system · June 14, 2022, 2:48pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.