Morning,
i trying to scrape some data from SoFifa.com, i detected a problem in parsing a button that contains a list of hyperlinks.
My goal is to capture the values from this botton with drop-down menu and then parse for each ipelink of some objects. I have no problems with single items on the menu so I can't find any way to take interest values. Does anyone have ideas?
if i try with CSS selector or XPATH on button list's singular values i obtain values only for button label, but for the interest values R give me:
{xml_missing}
here simple code to test
# insert URL
url <- paste0("https://sofifa.com//player/230621")
#parsing html
html <- xml2::read_html(url)
#test history button label
test <- html %>% html_node("#version-jump > option:nth-child(1)") %>%
html_text()
#test history button values
test <- html %>% html_node("#version-jump > option:nth-child(2)")
I try to inspect object but i don't understand how to grep singular values to create a function to take all hiperlinks.
The page you are trying to scrape is dynamically loaded using some js script.
You can see that because, in the html code you get, there is one node for #version-jump, so you get nothing when asking for the second node
library(rvest)
#> Le chargement a nécessité le package : xml2
url <- paste0("https://sofifa.com//player/230621")
html <- xml2::read_html(url)
html %>% html_nodes("#version-jump")
#> {xml_nodeset (1)}
#> [1] <select id="version-jump" class="form-select redirect"><option value ...
html %>% html_nodes("#version-jump > option")
#> {xml_nodeset (1)}
#> [1] <option value="">History Version</option>
Created on 2019-05-01 by the reprex package (v0.2.1.9000)
You need to use a package that can scrape JS rendered website. There is several options
All this option won't necessarly work but some will
example with decapitated:
library(decapitated)
library(rvest)
#> Le chargement a nécessité le package : xml2
url <- "https://sofifa.com/player/230621"
html <- chrome_read_html(url)
html %>%
html_nodes("#version-jump > option") %>%
length()
#> [1] 295
html %>% html_node("#version-jump > option:nth-child(1)") %>% html_text()
#> [1] "History Version"
html %>% html_node("#version-jump > option:nth-child(2)") %>% html_text()
#> [1] "Apr 25, 2019"
Created on 2019-05-01 by the reprex package (v0.2.1.9000)
Example with crrri
It is a low level for now and still in dev so it can evolve quickly but you can control the chrome browser from R directly.
A dump_DOM function needs to be create to get the html rendered by JS to read using rvest after. A new should contain those functions soon.