Is there some way in R to only save the text from a News Article on the internet??
library(htm2txt)
url_1 <- 'https://en.wikipedia.org/wiki/Alan_Turing'
text_1 <- gettxt(url_1)
url_2 <- 'https://www.bbc.com/future/article/20220823-how-auckland-worlds-most-spongy-city-tackles-floods'
text_2 <- gettxt(url_2)
All the text from the article appears, but so does a lot of "extra text" which does not have any meaning. For example:
p. 40/03B\n• ^ a
or identifiers\n• Articles with GND identifiers\n• Articles with ICCU identifiers\n•
- Is there some standard way to only keep the actual text from these articles? Or does this depend too much on the individual structure of the website and no "one size fits all" solution exists for such a problem?
- Perhaps there might be some method of doing this in R that only recognizes the "actual text"?
Thank you!