Scraping past html comments with rvest

dcruvolo · July 13, 2018, 11:56am

Glad it worked out! I used Inspect Element and typed out the css path by reading the source code. Sorry, it looked like the previous version dropped some elements (I'm not sure what I was thinking by using span:nth-child(2)). I like the changes in the css path. The data is a better format too.

Where there are images instead of text, you can extract the value in the bc attribute located in <stats-broadcaster-logo>. This path is defined below.

# set paths: for <span> and for <stats-broadcaster-logo>
path <- "#scoresPage > .row:nth-child(2) > .scores__inner > div:nth-child(1) > .linescores-container > .game > .row > .large-12 > .linescore-header > .scores__inner__broadcaster"
img.path <- paste0(path," > stats-broadcaster-logo")

Then, use the getElementAttribute function to extract the text in the attribute bc.

# scrape elements
logo <- rsc$findElements(using = "css",value = img.path)

# extract text
imgs <- sapply(logo, function(x){ x$getElementAttribute("bc") })
imgs <- data.matrix(imgs)

Here's the full r code.

# set up
require(RSelenium)
rsd <- RSelenium::rsDriver(browser = "chrome")
rsc <- rsd$client

# navigate to page
rsc$navigate("https://stats.nba.com/scores/04/11/2018")

# set paths: for <span> and for <stats-broadcaster-logo>

path <- "#scoresPage > .row:nth-child(2) > .scores__inner > div:nth-child(1) > .linescores-container > .game > .row > .large-12 > .linescore-header > .scores__inner__broadcaster"

img.path <- paste0(path," > stats-broadcaster-logo")

# scrape elements
el <- rsc$findElements(using = "css",value=path)
logo <- rsc$findElements(using = "css",value = img.path)

# extract text
out <- sapply(el, function(x){x$getElementText()})
channels <- data.matrix(out)

# extract attributes
imgs <- sapply(logo, function(x){ x$getElementAttribute("bc") })
imgs <- data.matrix(imgs)

# view
channels
imgs

# continue with transformations

# close all connections
rsc$close()

Hope that helps!