[dplyr + stringr] Filter by list of starts_with wildcards

TPDeRamus · July 17, 2024, 9:11pm

Hi Posit Community.

Hopefully quick question here.

Trying to filter a dataframe by a list of strings found at the beginning of each value in a column, but I'm not sure what the most efficient way to do this would be or how to deploy it across a list.

As an example, lets say I have the following:

Participant	Category	Rating
Greg	F0	21
Greg	C0.0	NA
Donna	1	17
Donna	01	21

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, NA))

But I only want to retain rows that begin with the following strings:

filterlist <- c("F","C","1")

Like so:

Participant	Category	Rating
Greg	F0	21
Greg	C0.0	NA
Donna	1	17

I'm unfortunately drawing a blank on how best to implement this.

The following do nothing:

> df|> filter(Category %in% paste0("^",filterlist))
[1] Participant Category    Rating     
<0 rows> (or 0-length row.names)

> df |> filter(Category == paste(paste0("^",filterlist), collapse="|"))
[1] Participant Category    Rating     
<0 rows> (or 0-length row.names)

> df |> filter(Category == stringr::str_starts(Category, filterlist))
Error in `filter()`:
ℹ In argument: `Category == stringr::str_starts(Category, filterlist)`.
Caused by error in `stringr::str_starts()`:
! Can't recycle `string` (size 4) to match `pattern` (size 3).
Run `rlang::last_trace()` to see where the error occurred.

Would anyone happen to know what I'm missing here?

Thank you in advance!

M_AcostaCH · July 17, 2024, 9:23pm

Hi @TPDeRamus , try with this:

library(dplyr)
library(stringr)
library(purrr)

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, 21))

filterlist <- c("F", "C", "1")

filtered_df <- df %>%
  filter(map_lgl(Category, ~ any(str_starts(.x, filterlist))))

#  Participant Category Rating
#1        Greg       F0     21
#2        Greg     C0.1     NA
#3       Donna        1     17

dromano · July 17, 2024, 10:10pm

Here's another approach that uses regular expressions instead of literal strings:

filterlist <- c("F","C","1")

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, NA))

library(stringr)
# create regular expression for match criteria
fltr_str <-  
  filterlist |> 
  # use magrittr pipe to allow period replacement
  str_c( collapse = '|') %>% # insert OR operator
  str_c('^(', ., ')') # prepend START operator and enclose OR term

fltr_str
#> [1] "^(F|C|1)"

library(dplyr)
df |> filter(str_detect(Category, fltr_str))
#>   Participant Category Rating
#> 1        Greg       F0     21
#> 2        Greg     C0.1     NA
#> 3       Donna        1     17

^{Created on 2024-07-17 with reprex v2.0.2}

keithn · July 18, 2024, 11:10am

library(dplyr)
library(stringr)

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, 21))

filterlist <- c("F", "C", "1")

df |>
  filter(str_starts(Category, paste(filterlist, collapse = "|")))
##   Participant Category Rating
## 1        Greg       F0     21
## 2        Greg     C0.1     NA
## 3       Donna        1     17

dromano · July 18, 2024, 11:20am

I wasn't aware of str_starts() — very handy!

TPDeRamus · July 18, 2024, 8:36pm

Unrelated but shame this doesn't have an arrow mapping.

dromano · July 19, 2024, 12:18am

Does that mean that you can't use stringr functions when filtering an arrow table?

TPDeRamus · July 19, 2024, 2:01pm

So here's the thing.

You can, but it behaves oddly based on the syntax.

So if you make it into an arrow table:

df <- data.frame(
  Participant = c('Greg', 'Greg', 'Donna', 'Donna'),
  Category = c('F0', 'C0.1', '1', '01'),
  Rating = c(21, NA, 17, 21)) |> as_arrow_table()

filterlist <- c("F", "C", "1")

And run a call like this one, it either fails or pulls it into R:

df |>
  filter(str_starts(Category, paste(filterlist, collapse = "|")))

But these two will:

df |>
  filter(str_starts(Category, "F|C|1"))

df |>
  filter(str_starts(Category, filtervar))

Think it's a bug. Reporting it to the arrow devs.

dromano · July 19, 2024, 2:18pm

Does anything change if you use to_duckbd() and to_arrow() to wrap the filter() call?

df |>
  to_duckdb() |> 
  filter(... |> 
  to_arrow()

nirgrahamuk · July 19, 2024, 2:19pm

if arrow team had intent to implement stringr functions, but not necessarilly all base functions that relate to string manipulation, it might explain why paste spoils your party, but it makes one think that perhaps stringr::str_c() might work

TPDeRamus · July 19, 2024, 2:35pm

That's on my to-do list but the duckdb package currently won't install on the renv I have configured for some reason.

Trying to figure out why.

system · October 17, 2024, 2:35pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.