I'm currently cleaning a dataset of website meta titles, and many follow a format similar to this: headline - Publication name. E.g.: Theater - The New York Times. I want to remove everything past the -, but am running into some issues.
This was my initial code, but it resulted in removing too much info, for example all content after a hyphen or sometimes entire titles even when there wasn't a hyphen:
l3 <- hl3 %>% mutate(title3 = str_replace(title2, regex("(?:.(?!-))+$"), "#"))
Given that I want to remove text that is part of an after a pattern of space, hyphen, space, capital letter, I then tried:
hl3 <- hl3 %>% mutate(title3 = str_replace(title2, regex("(?:.(?![:space:]-[:space:][:upper:]))+$"), "#"))
This works as follows - in two instances it works as desired:
"Theater - The New York Times" ends up as Theater# (the hash isn't an issue, as will change it, just makes it easier during the process to set what is being removed)
"‘Freestyle Love Supreme’ Review: Hip-Hop Saves the..." stays the same (i.e. nothing is deleted after the - in hip-hop)
But some headlines are still mysteriously removed:
"The Latest: Powell signals Fed to forgo future rate cuts" ends up as just # (doesn't matter whether or not there is a colon in the headline".
Can anyone see how I could tweak this code to get the desired result and stop it removing entire headlines that don't have the " - A" format in?