I am trying to scrape a web page which contains the following structure:
<p>
<a href="https://somewhere1.com">sometext1</a>
<br>
somemoretext1
</p>
<p>
<a href="https://somewhere2.com">sometext2</a>
<br>
somemoretext2
<br>
<br>
<a href="https://somewhere3.com">sometext3</a>
<br>
somemoretext3
</p>
Basically I would like to split up the second <p>
node by replacing the two adjacent <br>
tags with </p><p>
or similar before I select all <p>
nodes for further processing (with html_nodes("p")
). So every <p>
node should contain only one link plus "somemoretext", just like the first <p>
node.
In the end I want to scrape all link-URLs, all "sometext"s, and all "somemoretext"s.
I assume that xml2::xml_replace()
could be part of a solution, but I haven't figured out how to get it to work, yet, even after reading the modification vignette.
(Note that the document contains many more <p>
nodes with sometimes multiple adjacent <br>
tags, so I might have to split up one <p>
node into more than two.)