[dplyr + stringr] Filter by list of starts_with wildcards

Hi Posit Community.

Hopefully quick question here.

Trying to filter a dataframe by a list of strings found at the beginning of each value in a column, but I'm not sure what the most efficient way to do this would be or how to deploy it across a list.

As an example, lets say I have the following:

Participant Category Rating
Greg F0 21
Greg C0.0 NA
Donna 1 17
Donna 01 21
df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, NA))

But I only want to retain rows that begin with the following strings:

filterlist <- c("F","C","1")

Like so:

Participant Category Rating
Greg F0 21
Greg C0.0 NA
Donna 1 17

I'm unfortunately drawing a blank on how best to implement this.

The following do nothing:

> df|> filter(Category %in% paste0("^",filterlist))
[1] Participant Category    Rating     
<0 rows> (or 0-length row.names)

> df |> filter(Category == paste(paste0("^",filterlist), collapse="|"))
[1] Participant Category    Rating     
<0 rows> (or 0-length row.names)

> df |> filter(Category == stringr::str_starts(Category, filterlist))
Error in `filter()`:
ℹ In argument: `Category == stringr::str_starts(Category, filterlist)`.
Caused by error in `stringr::str_starts()`:
! Can't recycle `string` (size 4) to match `pattern` (size 3).
Run `rlang::last_trace()` to see where the error occurred.

Would anyone happen to know what I'm missing here?

Thank you in advance!

Hi @TPDeRamus , try with this:

library(dplyr)
library(stringr)
library(purrr)

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, 21))

filterlist <- c("F", "C", "1")

filtered_df <- df %>%
  filter(map_lgl(Category, ~ any(str_starts(.x, filterlist))))

#  Participant Category Rating
#1        Greg       F0     21
#2        Greg     C0.1     NA
#3       Donna        1     17

3 Likes

Here's another approach that uses regular expressions instead of literal strings:

filterlist <- c("F","C","1")

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, NA))

library(stringr)
# create regular expression for match criteria
fltr_str <-  
  filterlist |> 
  # use magrittr pipe to allow period replacement
  str_c( collapse = '|') %>% # insert OR operator
  str_c('^(', ., ')') # prepend START operator and enclose OR term

fltr_str
#> [1] "^(F|C|1)"

library(dplyr)
df |> filter(str_detect(Category, fltr_str))
#>   Participant Category Rating
#> 1        Greg       F0     21
#> 2        Greg     C0.1     NA
#> 3       Donna        1     17

Created on 2024-07-17 with reprex v2.0.2

2 Likes
library(dplyr)
library(stringr)

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, 21))

filterlist <- c("F", "C", "1")

df |>
  filter(str_starts(Category, paste(filterlist, collapse = "|")))
##   Participant Category Rating
## 1        Greg       F0     21
## 2        Greg     C0.1     NA
## 3       Donna        1     17
2 Likes

I wasn't aware of str_starts() — very handy!

1 Like

Unrelated but shame this doesn't have an arrow mapping.

Does that mean that you can't use stringr functions when filtering an arrow table?

So here's the thing.

You can, but it behaves oddly based on the syntax.

So if you make it into an arrow table:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, 21)) |> as_arrow_table()

filterlist <- c("F", "C", "1")

And run a call like this one, it either fails or pulls it into R:

df |>
  filter(str_starts(Category, paste(filterlist, collapse = "|")))

But these two will:

df |>
  filter(str_starts(Category, "F|C|1"))

df |>
  filter(str_starts(Category, filtervar))

Think it's a bug. Reporting it to the arrow devs.

Does anything change if you use to_duckbd() and to_arrow() to wrap the filter() call?

df |>
  to_duckdb() |> 
  filter(... |> 
  to_arrow()

if arrow team had intent to implement stringr functions, but not necessarilly all base functions that relate to string manipulation, it might explain why paste spoils your party, but it makes one think that perhaps stringr::str_c() might work

That's on my to-do list but the duckdb package currently won't install on the renv I have configured for some reason.

Trying to figure out why.

1 Like