str_detect() + filter() slightly working bamboozle... - reprex included

Hi there,

I was playing around with a dataset in the rrcov library and I decided to do something very simple / mundane which was to sub-set the dataset being used.

library("rrcov")
library("dplyr")
library("stringr")
# Package with the Fish and many other datasets
data(fish)
# The fish dataset requires a little data wrangling.
fish <- fish %>% mutate(
  `Species` = case_when(
    `Species` == "1" ~ "Bream",
    `Species` == "2" ~ "Whitewish",
    `Species` == "3" ~ "Roach",
    `Species` == "4" ~ "Parkki",
    `Species` == "5" ~ "Smelt",
    `Species` == "6" ~ "Pike",
    `Species` == "7" ~ "Perch"
  )
)

# Species before filter:
Summary1 <- fish %>% count(`Species`, sort = TRUE)

# Decided to filter out few selected species 
unique(fish$Species)

# These are the ones to be removed:
rmv_fish <- c("Parkki","Whitewish", "Smelt")
# Initially thought of creating a vector containing the undesired species
# Once issue identified, tried typing it out...

With rmv_fish my intention was to remove those specific fish using str_detect() within a filter() to create the desired sub-set.

Create the sub-set for the desired fish:

# Create a sub-set: 
fish2 <- fish %>% rename(`mass_g` = Weight,
                         `length_cm` = Length3) %>% 
  select(mass_g, length_cm, Height, Width, Species) %>% 
  filter(
    !str_detect(`Species`, pattern = c("Parkki","Whitewish","Smelt"))
  )

# Checking 
Summary2 <- fish2 %>% count(`Species`, sort = TRUE)

# Trying to see if spelling was off or something..
fish2 %>% filter(
  `Species` == c("Parkki",
                 "Whitewish",
                 "Smelt")
)

# Bamboozled, not all fish are being removed, dunno why...
Summary1
Summary2

Summary1
Species n
1 Perch 56
2 Bream 35
3 Roach 20
4 Pike 17
5 Smelt 14
6 Parkki 11
7 Whitewish 6

Summary2
Species n
1 Perch 56
2 Bream 35
3 Roach 20
4 Pike 17
5 Smelt 10
6 Parkki 8
7 Whitewish 4

I noticed that the desired fish were still in the sub-set so I decided to run a reprex on that bit of code, the below is what I got. There is a drop in the number of observations, the dataset starts with 159 and drops to 150 with the below and no errors are alerted. The error only appears when creating a reprex

fish2 <- fish %>% rename(`mass_g` = Weight,
                         `length_cm` = Length3) %>% 
  select(mass_g, length_cm, Height, Width, Species) %>% 
  filter(
    !str_detect(`Species`, pattern = c("Parkki","Whitewish","Smelt"))
  )
#> Error in fish %>% rename(mass_g = Weight, length_cm = Length3) %>% select(mass_g, : could not find function "%>%"

Created on 2021-06-17 by the reprex package (v2.0.0)

Any insight on what I'm not seeing?

It must be something very simple, but I can't seem to identify what is causing the issue..
Species don't appear to be written any differently.

Thanks for the time.

Hi there,

The issue you are having is because the pattern for str_detect has to be RegEx, and not a list of possible values.
Here is an example on how to use both regex or a list to filter

library(dplyr)
library(stringr)

#Get some data to play with
myData = iris %>% slice(1,2,51,52,101,102)
myData
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 1          5.1         3.5          1.4         0.2     setosa
#> 2          4.9         3.0          1.4         0.2     setosa
#> 3          7.0         3.2          4.7         1.4 versicolor
#> 4          6.4         3.2          4.5         1.5 versicolor
#> 5          6.3         3.3          6.0         2.5  virginica
#> 6          5.8         2.7          5.1         1.9  virginica

# Filter with RegEx
myData %>% filter(
  !str_detect(Species, pattern = "virginica|versicolor")
)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa

# Filter with list
myData %>% filter(
  !Species %in% c("virginica","versicolor")
)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa

# Regex is more powerful (example ignore all that start with a 'v')
myData %>% filter(
  !str_detect(Species, pattern = "^v")
)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa

Created on 2021-06-17 by the reprex package (v2.0.0)

If you like to learn RegEx patterns, check out this great tutorial: https://regexone.com/

Hope this helps,
PJ

2 Likes

Yeah, str_detect can only take one string argument.

I usually use a map to loop over the patterns and a reduce to apply the logic of and or or to return the strings that match all patterns or any patterns.

See example below.

I have stubbornly refused to learn regex and I think I'm happier for it! Personally I wish that stringr would have punted regex altogether.

library(tidyverse)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

iris$Species %>%
  list() %>%
  map2(c("setosa", "versicolor"), str_detect) %>%
  reduce(or)
#>   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [97]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [145] FALSE FALSE FALSE FALSE FALSE FALSE

Created on 2021-06-17 by the reprex package (v1.0.0)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.