filter out tweets from a dataset in R

mirrorballcrying · June 3, 2025, 6:31pm

Its working because I can definitley see a new df forming, but I´m suspicious, that it doesnt really search for all of the keywords I put in because I did some brief trials here and there. I´m pulling the data in R with the hel of arrow because otherwise R wouldnt be able to compute this. Do you know if I did something wrong?

AlexisW · June 4, 2025, 8:33pm

I can see at least 2 problems. First, the dot . has a special meaning in regex (match any character), you need to escape it:

library(tidyverse)

tibble(txt = c("roe va wade", "roe v. wade")) |>
  filter(str_detect(txt, "roe v. wade"))
#> # A tibble: 2 × 1
#>   txt        
#>   <chr>      
#> 1 roe va wade
#> 2 roe v. wade

tibble(txt = c("roe va wade", "roe v. wade")) |>
  filter(str_detect(txt, "roe v\\. wade"))
#> # A tibble: 1 × 1
#>   txt        
#>   <chr>      
#> 1 roe v. wade

Second, the fact that you have newlines inside the regex is a problem:

library(tidyverse)

tibble(txt = c("roevwade", "roe v. wade")) |>
  filter(str_detect(txt, "roevwade|
                    roe v\\. wade"))
#> # A tibble: 1 × 1
#>   txt     
#>   <chr>   
#> 1 roevwade

tibble(txt = c("roevwade", "roe v. wade")) |>
  filter(str_detect(txt, "roevwade|roe v\\. wade"))
#> # A tibble: 2 × 1
#>   txt        
#>   <chr>      
#> 1 roevwade   
#> 2 roe v. wade

The most convenient might be to separately make a list of patterns, and assemble them into the final pattern:

library(tidyverse)

dat <- tibble(txt = c("roevwade", "roe v. wade", "roe va wade"))

match_terms <- c("roevwade",
                 "roe v\\. wade")
pattern <- str_c(match_terms, collapse = "|")

str_detect(dat$txt, pattern)
#> [1]  TRUE  TRUE FALSE