Saving the Text of a News Article in R?

omario · August 27, 2022, 6:07pm

Is there some way in R to only save the text from a News Article on the internet??

library(htm2txt)
url_1 <- 'https://en.wikipedia.org/wiki/Alan_Turing'
text_1 <- gettxt(url_1)

url_2 <- 'https://www.bbc.com/future/article/20220823-how-auckland-worlds-most-spongy-city-tackles-floods'
text_2 <- gettxt(url_2)

All the text from the article appears, but so does a lot of "extra text" which does not have any meaning. For example:

p. 40/03B\nâ€¢ ^ a or identifiers\nâ€¢ Articles with GND identifiers\nâ€¢ Articles with ICCU identifiers\nâ€¢

Is there some standard way to only keep the actual text from these articles? Or does this depend too much on the individual structure of the website and no "one size fits all" solution exists for such a problem?
Perhaps there might be some method of doing this in R that only recognizes the "actual text"?

Thank you!

DavoWW · August 30, 2022, 9:43am

Hi @omario,
This looks like an encoding issue. Try:

text_1 <- gettxt(url_1, encoding="UTF-8")

system · September 20, 2022, 9:44am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.