Can't extract hrefs from table

tippytappy · September 12, 2023, 6:54am

I'm trying to get the links for the 2 tables in this webpage:
https://www.ipdb.org/lists.cgi?puid=43799&browser=1694266618&list=top300

For some reason neither css selector or xpath seems to pull the tables so I've had to first pull the tables into a list, then extract each. That's got me the tables. But I can't get the links. I've tried various combinations of selector. If I include the 'a' part of the selector I get nothing; it's as if rvest can't see the links on this page. Or, more likely, I'm doing something wrong.

My code is below. Any help will be greatly appreciated because I've run out of ideas.

url <- "https://www.ipdb.org/lists.cgi?anonymously=true&list=top300"
download.file(url, destfile = 'machines_top_300.html')
page <- read_html("machines_top_300.html")

# put the 3 tables into a list
top_300_tables <- page %>%
  html_nodes(xpath = "//table[.//th[contains(., 'Rank')]]")

# Get the table for the electronic machines (list item 2)
machines_top_300_electronic <- 
  top_300_tables[2] %>%
  html_table(fill = TRUE) %>% 
  as.data.frame() %>% 
  mutate(Category = "Electronic")

# Get the hrefs for the electronic machines
machines_top_300_electronic_links <- 
  top_300_tables[2] %>%
  html_nodes('tr > td:nth-child(3) > a') %>% 
  html_attr('href')

nirgrahamuk · September 12, 2023, 9:59am

The page you wrote the URL to, is different to the one you screenshot.
There are no links when I look at https://www.ipdb.org/lists.cgi?anonymously=true&list=top300
Perhaps you see a different view when you log in non anonymously ?

tippytappy · September 12, 2023, 11:48am

Oh my goodness thank you! I was inspecting the online content and seeing links, and had assumed that's what I'd downloaded. But the downloaded page didn't have links. So of course R wasn't finding any. I downloaded the logged-in version and now have the links I needed.

I feel very dopey for not spotting my error.

system · September 19, 2023, 11:49am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.