I'm using the readtext package and am seeking to delete the reference list from each document as part of the the text cleaning process. (Each document is a PDF of a research article.) I tried the package's functions, including readtext::stri_replace_all_fixed(), but these replace specific chunks of text. How can I remove all text after the final instance of a word, such as after the word 'References'?
Here's an example of my original and desired text. I wish to delete all text after the word References appears a final time in that document:
text_orig<- c("Some text written about a topic, with the word Reference in it",
"Some more text, with the word References followed by text I wish to delete",
"More pages with stuff to be deleted")
text_desired<- c("Some text written about a topic, with the word Reference in it",
"Some more text, with the word")
library(stringr)
text_orig<- c("Some text written about a topic, with the word Reference in it",
"Some more text, with the word References followed by text I wish to delete",
"More pages with stuff to be deleted")
pat <- "Reference.*$"
retain <- "Reference"
str_replace(text_orig, pat, retain)
#> [1] "Some text written about a topic, with the word Reference"
#> [2] "Some more text, with the word Reference"
#> [3] "More pages with stuff to be deleted"
You'll need to split the string for this to work twice, on separate sentences, to avoid getting tied up in knots writing the regex. It's possible, and if you are an adept, go for it. Otherwise, don't torture yourself.
It's not iron clad; much of the time a reprex is needed to reproduce a problem enough to understand it. Your question provided all of the pieces necessary.