I'm trying to scrape a table (I think in HTML?), and I can't seem to find the right code with CSS Selector to scrape the table for goals scored -- I just get a {xml_nodeset (0)}
Any ideas? (also, please let me know if this is the type of question that I shouldn't be asking here)
Oh, ok— I couldn't see that far into the table and didn't know there was more (was thinking maybe you'd inadvertently been trying to select a column that wasn't there).
The data is being loaded with JavaScript. If you try to select tables in the scraped HTML, there aren't any:
library(rvest)
#> Loading required package: xml2
h <- read_html('http://www.uscho.com/recaplink.php?gid=1_970_20172018')
h %>% html_nodes('table')
#> {xml_nodeset (0)}
If you load it in a browser, depending on how fast your connection is, you'll also see a brief "Loading" message for each table, which also tells you the data isn't baked into the HTML originally. On the R side, you can scan through h %>% html_structure(), and you'll see that it looks different than the live page rendered in a browser, and doesn't contain the information you need.
The most direct way to get the data is to run the JavaScript just like your browser would, e.g. by scraping with RSelenium or splashr, and then grab the HTML. (After you scrape the source, you can still parse the HTML with rvest.)
There are sometimes clever ways around such an approach (RSelenium and splashr are decidedly heavier than rvest), but they require looking deeper into how the data is loaded.
Yeah, it's a bit of a bear. The examples in the docs are helpful, though; you can often adapt them to what you need. The package is object-oriented in a way that most in R aren't; a lot of the functions you need will be methods of the remote driver. What works for me (but may or may not for you, annoyingly):
library(RSelenium)
library(rvest)
rd <- rsDriver()
rd$client$navigate('http://www.uscho.com/recaplink.php?gid=1_970_20172018')
h <- rd$client$getPageSource()
h <- h[[1]] %>% read_html()
rd$client$close()
rd$server$stop()
rm(rd)
boxgoals <- h %>%
html_node('#boxgoals') %>%
html_table()
boxgoals
#> Per Team Scorer Assist 1 Assist 2 Goal Type Time
#> 1 1 Boston College-1 Connor Moore Mike Booth Casey Carreau 15:30
#> 2 2 Providence-1 Erik Foley Spenser Young 4x4 08:22
#> 3 2 Providence-2 Ben Mirageas Scott Conway Spenser Young GWG PPG 5x4 19:14
This works, but is sort of a pain. splashr is a newer alternative that is built to contain a lot of the messiness in docker. Also nicely, its render_html function returns an xml2 object like rvest uses, so it can integrate directly. Note you'll need to install and start docker before the following will work.
library(splashr)
library(rvest)
# install_splash() # run this once to install the docker image
sp <- start_splash()
pg <- render_html(url = 'http://www.uscho.com/recaplink.php?gid=1_970_20172018')
stop_splash(sp)
boxgoals <- pg %>%
html_node('#boxgoals') %>%
html_table()
boxgoals
#> Per Team Scorer Assist 1 Assist 2 Goal Type Time
#> 1 1 Boston College-1 Connor Moore Mike Booth Casey Carreau 15:30
#> 2 2 Providence-1 Erik Foley Spenser Young 4x4 08:22
#> 3 2 Providence-2 Ben Mirageas Scott Conway Spenser Young GWG PPG 5x4 19:14
There's much more to using docker fully, of course. Here's a nice tutorial to get you started. In this case, you don't really need to know much, but it is important to realize that install_splash will download a 1.2Gb docker image to your machine. The above tutorial explains how to delete it afterwards if you want your disk space back.
For future reference, I just spent hours going insane, trying to figure out why I couldn't use Docker. I'm on Windows 10 and I needed to enable virtualization. So if you need to do that, google "enable virtualization windows 10" and it should help you.
You might have a look at PhantomJS. It's a headless browser that should allow you to render and then save pages, then scrape the saved page, with tables now in HTML.
Take a look at decapitated via gitlab.com/hrbrmstr/decapitated (or github for legacy code sharing service users). It's much less complex than splashr and may get you what you need.
phanotmjs is also in "perhaps the community will keep it going" mode ever since headless chrome (what decapitated uses) came on the scene.
I'd strongly suggest (for a number of reasons) using the decapitated::download_chromium() function. After doing to, it will tell you the environment variable setting you need to add to ~/.Renviron. That way the browser automation ops are kept separate from your main Chrome binary so there's no possible corruption of your own Chrome profile and no chance it will ever not be "headless" (and also means you can ditch the Google-spying Chrome and use the far superior Firefox Developer Edition
At this point I'd probably recommend using hrbrmstr's decapitated package he linked above, which is less of a pain than the other options. Install the package, configure it (meaning probably use the helper to install chromium, set the environment variable in ~/.Renviron, and restart R), and then you can use chrome_read_html to grab and xml2 object you can parse normally with rvest.