I'm trying to extract the "href" from this xml_nodeset
, but html_attr("href")
-- which usually works -- won't work here. Any idea how I can extract the "href" part of this? Thanks!
library(rvest)
library(splashr)
library(rvest)
sp <- start_splash()
page <- splashr::render_html(url = "https://www.nhl.com/gamecenter/phi-vs-bos/1974/05/07/1973030311#game=1973030311,game_state=final")
stop_splash(sp)
page %>% html_nodes('[class="name"]')
# {xml_nodeset (5)}
# [1] <div class="name"><strong><a href="https://www.nhl.com/player/wayne-cashman-8446002" data-player-link="8446002" ...
# [2] <div class="name"><strong><a href="https://www.nhl.com/player/gregg-sheppard-8451335" data-player-link="8451335 ...
# [3] <div class="name"><strong><a href="https://www.nhl.com/player/orest-kindrachuk-8448495" data-player-link="84484 ...
# [4] <div class="name"><strong><a href="https://www.nhl.com/player/bobby-clarke-8446098" data-player-link="8446098"> ...
# [5] <div class="name"><strong><a href="https://www.nhl.com/player/bobby-orr-8450070" data-player-link="8450070">Bob ...
page %>% html_nodes('[class="name"]') %>% html_attr("href")
# [1] NA NA NA NA NA
cderv
August 10, 2018, 1:40pm
2
It seems according to your example that you need to select two nodes under the current one to get the <a
node and get the href attributes. Currently you are trying to get href from the div
of class name, and it does not have href.
You should use XPATH or css selectors to get to these nodes. Or navigate into the xml structure using xml_children
and friends.
I can't make an example because I do not have my computer right now. Hope it is clear enough
1 Like
cderv
August 10, 2018, 8:33pm
3
I can now
With this selector "div.name > strong > a"
, it is working:
select all <a>
under a <strong>
that is under a <div>
of class "name"
library(splashr)
#> Warning: le package 'splashr' a été compilé avec la version R 3.4.4
sp <- splash("192.168.99.100")
page <- render_html(sp, url = "https://www.nhl.com/gamecenter/phi-vs-bos/1974/05/07/1973030311#game=1973030311,game_state=final")
library(rvest)
#> Le chargement a nécessité le package : xml2
page %>%
html_nodes("div.name > strong > a") %>%
html_attr("href")
#> [1] "https://www.nhl.com/player/wayne-cashman-8446002"
#> [2] "https://www.nhl.com/player/gregg-sheppard-8451335"
#> [3] "https://www.nhl.com/player/orest-kindrachuk-8448495"
#> [4] "https://www.nhl.com/player/bobby-clarke-8446098"
#> [5] "https://www.nhl.com/player/bobby-orr-8450070"
Created on 2018-08-10 by the reprex package (v0.2.0).
5 Likes
Goddamn, you're always so helpful :). Thanks, @cderv .
1 Like