What is the correct way to explore "xml_nodes" contents

DiegoJ · July 13, 2022, 1:49pm

Hi.
html_nodes() returns a xml_nodes datatype, which is normally processed afterwards with html_tables to convert tables to frames.

As you know, sometimes you get spurious data, that one need to understand where it comes from.

So the question is, how do you explore xml_nodes contents?

In the next web-scrapping example, the author wants to extract the 3rd table, but in the xml_nodes result, it seems to be the 6th found by Trial&Error.
My guess is that some HTML contained in javascript on the header is being processed as tables. However it took me time to get to this and I would like to inspect the xml_nodes faster for the next time.

population_html <- 
  read_html("https://en.wikipedia.org/wiki/List_of_countries_by_population_in_1900")

population_nodes <- 
  html_nodes(population_html, "table")

population_nodes
View(population_nodes)
str(population_nodes)

View, print, and str, don't really shows much information about information contained on each node.

The problem gets bigger if I try to find a way to filter the tables below the div of main-content, as str() results is too large, and View() shows a tree with no information.

html_nodes(population_html, "div")

system · August 3, 2022, 1:49pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.