rvest/xml2 replace nodes before scraping

dpprdan · January 22, 2021, 2:53pm

I am trying to scrape a web page which contains the following structure:

<p>
  <a href="https://somewhere1.com">sometext1</a>
  <br> 
    somemoretext1
</p>
<p>
  <a href="https://somewhere2.com">sometext2</a>
  <br> 
    somemoretext2
  <br>
  <br>
  <a href="https://somewhere3.com">sometext3</a>
  <br> 
    somemoretext3
</p>

Basically I would like to split up the second  node by replacing the two adjacent   tags with  or similar before I select all  nodes for further processing (with html_nodes("p")). So every  node should contain only one link plus "somemoretext", just like the first  node.

In the end I want to scrape all link-URLs, all "sometext"s, and all "somemoretext"s.

I assume that xml2::xml_replace() could be part of a solution, but I haven't figured out how to get it to work, yet, even after reading the modification vignette.

(Note that the document contains many more  nodes with sometimes multiple adjacent   tags, so I might have to split up one  node into more than two.)

system · February 12, 2021, 2:53pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.