WEBSCARPING: RVEST output List of 0

martinocrippa · April 29, 2019, 8:05am

Morning,
i trying to scrape some data from SoFifa.com, i detected a problem in parsing a button that contains a list of hyperlinks.
My goal is to capture the values from this botton with drop-down menu and then parse for each ipelink of some objects. I have no problems with single items on the menu so I can't find any way to take interest values. Does anyone have ideas?

EXAMPLE and TEST:
Webpage is

button circled in red.

if i try with CSS selector or XPATH on button list's singular values i obtain values only for button label, but for the interest values R give me:
{xml_missing}

here simple code to test

# insert URL
url <- paste0("https://sofifa.com//player/230621")

#parsing html
html <- xml2::read_html(url)

#test history button label
test <- html %>% html_node("#version-jump > option:nth-child(1)") %>% 
                 html_text()

#test history button values
test <- html %>% html_node("#version-jump > option:nth-child(2)")

I try to inspect object but i don't understand how to grep singular values to create a function to take all hiperlinks.

thank you so much for any help
on hold

MC

cderv · May 1, 2019, 8:12am

The page you are trying to scrape is dynamically loaded using some js script.
You can see that because, in the html code you get, there is one node for #version-jump, so you get nothing when asking for the second node

library(rvest)
#> Le chargement a nécessité le package : xml2
url <- paste0("https://sofifa.com//player/230621")
html <- xml2::read_html(url)

html %>% html_nodes("#version-jump")
#> {xml_nodeset (1)}
#> [1] <select id="version-jump" class="form-select redirect"><option value ...
html %>% html_nodes("#version-jump > option")
#> {xml_nodeset (1)}
#> [1] <option value="">History Version</option>

^{Created on 2019-05-01 by the reprex package (v0.2.1.9000)}

You need to use a package that can scrape JS rendered website. There is several options

Using Selenium: https://ropensci.github.io/RSelenium/
Using Splash JS Redenring service: GitHub - hrbrmstr/splashr: 💦 Tools to Work with the 'Splash' JavaScript Rendering Service in R
Using Chrome Devtool protocol at the command line: GitHub - hrbrmstr/decapitated: Headless 'Chrome' Orchestration in R
Using Chrome Devtool protocol from R directly: GitHub - RLesur/crrri: A Chrome Remote Interface written in R (in dev not yet stable)
Using JAVA library htmlunit: GitHub - hrbrmstr/htmlunit: 🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library

All this option won't necessarly work but some will

example with decapitated:

library(decapitated)
library(rvest)
#> Le chargement a nécessité le package : xml2
url <- "https://sofifa.com/player/230621"
html <- chrome_read_html(url)
html %>% 
  html_nodes("#version-jump > option") %>%
  length()
#> [1] 295

html %>% html_node("#version-jump > option:nth-child(1)") %>% html_text()
#> [1] "History Version"
html %>% html_node("#version-jump > option:nth-child(2)") %>% html_text()
#> [1] "Apr 25, 2019"

^{Created on 2019-05-01 by the reprex package (v0.2.1.9000)}

Example with crrri

It is a low level for now and still in dev so it can evolve quickly but you can control the chrome browser from R directly.
A dump_DOM function needs to be create to get the html rendered by JS to read using rvest after. A new should contain those functions soon.

library(crrri)

dump_DOM <- function(url) {
  # require for crrri to be configured to find chrom
  chrome <- Chrome$new()
  on.exit(chrome$close())
  client <- hold(chrome$connect())
  Network <- client$Network
  Page <- client$Page
  Runtime <- client$Runtime
  Page$enable() %...>% {
    Network$enable()
  } %...>% {
    Network$setCacheDisabled(cacheDisabled = TRUE)
  } %...>% {
    Page$navigate(url)
  } %...>% {
    Page$loadEventFired()
  } %...>% {
    Runtime$evaluate(
      expression = 'document.documentElement.outerHTML'
    )
  } %>% {
    hold(.)$result$value
  }
}

dom <- dump_DOM(url = "https://sofifa.com/player/230621")
#> Running "C:/Users/chris/Documents/Chrome/chrome-win32/chrome.exe" \
#>   --no-first-run --headless \
#>   "--user-data-dir=C:\Users\chris\AppData\Local\r-crrri\r-crrri\chrome-data-dir-rouneflg" \
#>   "--remote-debugging-port=9222" --disable-gpu --no-sandbox
library(rvest)
#> Le chargement a nécessité le package : xml2
html <- read_html(dom)
html %>% 
  html_nodes("#version-jump > option") %>%
  length()
#> [1] 295

html %>% html_node("#version-jump > option:nth-child(1)") %>% html_text()
#> [1] "History Version"
html %>% html_node("#version-jump > option:nth-child(2)") %>% html_text()
#> [1] "Apr 25, 2019"

^{Created on 2019-05-01 by the reprex package (v0.2.1.9000)}

billyi · May 2, 2019, 1:53am

Another package you may want to try is webdriver: https://cran.r-project.org/web/packages/webdriver/

cderv · May 4, 2019, 9:48am

webdriver is a great package. It works well with PhantomJS but the problem is PhantomJS project has been stopped...

github.com/ariya/phantomjs

Archiving the project: suspending the development

opened 05:16PM - 03 Mar 18 UTC

ariya

meta

Due to the lack of active contribution, I am going to [archive](https://help.git…hub.com/articles/about-archiving-repositories/) this project soon. At some point in the future, if we pick up the development again (such as #15341, #15342, #15343), the project will be unarchived. With that, all the earlier plans regarding PhantomJS 2.5 (from @Vitallium) or 2.1.x (from @pixiuPL) will be **abandoned** effective immediately. Consequently, the source and binary packages for the above abandoned version will be removed to avoid any confusions. PhantomJS version **2.1.1** will remain the last known stable release until further notice. To keep the source repository in a sane situation: * the master branch will be preserved under the new `bleeding-edge` branch. * after that, the master branch will be restored back to the approximate state of v2.1.1 That way, if anyone wants to improve PhantomJS for their own usages (internal fork, in-house patches), the master branch can still serve as a good baseline. It can still be built from source and it will still work on macOS, Linux, and Windows. Once the project is archived, no new issue can be filed. However, feel free to use the [mailing-list](https://groups.google.com/forum/#!forum/phantomjs) to post questions and discuss any relevant topics. Thank you for your understanding!

system · May 25, 2019, 10:00am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.