Remove all text after the last instance of a specific word, in text analysis

geoharmony · March 20, 2020, 4:02am

I'm using the readtext package and am seeking to delete the reference list from each document as part of the the text cleaning process. (Each document is a PDF of a research article.) I tried the package's functions, including readtext::stri_replace_all_fixed(), but these replace specific chunks of text. How can I remove all text after the final instance of a word, such as after the word 'References'?

Here's an example of my original and desired text. I wish to delete all text after the word References appears a final time in that document:

text_orig<- c("Some text written about a topic, with the word Reference in it",
              "Some more text, with the word References followed by text I wish to delete",
              "More pages with stuff to be deleted")

text_desired<- c("Some text written about a topic, with the word Reference in it",
              "Some more text, with the word")

^{Created on 2020-03-19 by the reprex package (v0.3.0)}

technocrat · March 20, 2020, 6:07am

Hi, and welcome!

Please see the FAQ: What's a reproducible example (`reprex`) and how do I do one? Using a reprex, complete with representative data will attract quicker and more answers. This question doesn't require one, though.

The general approach

library(stringr)
text_orig<- c("Some text written about a topic, with the word Reference in it",
              "Some more text, with the word References followed by text I wish to delete",
              "More pages with stuff to be deleted")
pat <- "Reference.*$"
retain <- "Reference"
str_replace(text_orig, pat, retain)
#> [1] "Some text written about a topic, with the word Reference"
#> [2] "Some more text, with the word Reference"                 
#> [3] "More pages with stuff to be deleted"

^{Created on 2020-03-19 by the reprex package (v0.3.0)}

You'll need to split the string for this to work twice, on separate sentences, to avoid getting tied up in knots writing the regex. It's possible, and if you are an adept, go for it. Otherwise, don't torture yourself.

geoharmony · March 20, 2020, 6:30am

So sorry about not reading the FAQ as carefully as possible.

technocrat · March 20, 2020, 4:26pm

It's not iron clad; much of the time a reprex is needed to reproduce a problem enough to understand it. Your question provided all of the pieces necessary.

system · March 27, 2020, 4:26pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.