Issue with regex phrase using stringr

jpcronin · May 4, 2020, 7:01pm

I'm currently cleaning a dataset of website meta titles, and many follow a format similar to this: headline - Publication name. E.g.: Theater - The New York Times. I want to remove everything past the -, but am running into some issues.

This was my initial code, but it resulted in removing too much info, for example all content after a hyphen or sometimes entire titles even when there wasn't a hyphen:

l3 <- hl3 %>% mutate(title3 = str_replace(title2, regex("(?:.(?!-))+$"), "#"))

Given that I want to remove text that is part of an after a pattern of space, hyphen, space, capital letter, I then tried:

hl3 <- hl3 %>% mutate(title3 = str_replace(title2, regex("(?:.(?![:space:]-[:space:][:upper:]))+$"), "#"))

This works as follows - in two instances it works as desired:
"Theater - The New York Times" ends up as Theater# (the hash isn't an issue, as will change it, just makes it easier during the process to set what is being removed)
"‘Freestyle Love Supreme’ Review: Hip-Hop Saves the..." stays the same (i.e. nothing is deleted after the - in hip-hop)
But some headlines are still mysteriously removed:
"The Latest: Powell signals Fed to forgo future rate cuts" ends up as just # (doesn't matter whether or not there is a colon in the headline".
Can anyone see how I could tweak this code to get the desired result and stop it removing entire headlines that don't have the " - A" format in?

woodward · May 4, 2020, 7:12pm

Some of those character are special characters so you need to escape them with a double backslash.

``` r
library(stringr)
text <- "Theater - The New York Times"
str_replace(text, "(?<=\\s\\-\\s[A-Z]).+", "#")
#> [1] "Theater - T#"

^{Created on 2020-05-05 by the reprex package (v0.3.0)}

Leon · May 4, 2020, 7:18pm

Depending on the consistency of your variable, an alternative could be something like this:

library("tidyverse")
tibble(title2 = c("Theater - The New York Times",
                  "Theater - Washington Post")) %>% 
  mutate(title3 = title2 %>%
           str_split(pattern = " - ") %>%
           map(1) %>%
           unlist)

Hope it helps

jpcronin · May 4, 2020, 8:56pm

Thank you for this, didn't quite give me the result I needed, but it pointed me in the right direction which helped me to work out what worked which was:

mutate(title3 = str_replace(title2, "(?=\\s\\-\\s[A-Z]).+", "#"))

woodward · May 4, 2020, 8:59pm

I think (?<= ...) is for "preceded by". (?=...) is for "followed by".

system · May 11, 2020, 8:59pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.