I'm looking for an interesting example to illustrate rvest as a domain specific language for web scraping, and I'd love your help!
Shockingly, there's a one that I liked, by Kevin Wong:
http://kevinfw.com/post/web-scraping-with-r/
n.b. added for preview purposes
Also Maëlle Salmon's Notable Wikipedia Deaths
And Alex Bresler's NBA Draft Scraping, the R Way is pretty phenomenal! It's older, but broken into stages and steps, and is just an out and out good example of dealing with messy web data…
I did this one, also a sports one but an example nonetheless...
- Ben.
I'm working on a few right now, but they are pretty boring... I.e. scraping the web... I.e. exactly what the package is intended to do.
There is one blog post by Colin Faye from ThinkR about discovery of rvest with several example. It is pretty clear and well explain but right now it is in french
There is in this post an interesting example of using rvest
as a simple web scrawler.
You may want to find a more complex use case less focus about discovering rvest but could be useful... and could be easily translated.
I just recently wrote a script to scrape the Michigan Department of Corrections. Not really sure what types of scripts/projects you're looking to see, but you can find the code here. Hope it helps.
Straight from my notes: scrape links from a specified column in a well-formatted HTML table.
library(rvest)
library(xml2)
# Get table
html_table <- url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]]
# Get column names in table
header <- html_table %>%
html_nodes("thead") %>%
html_nodes("th") %>%
html_text()
# Find position of specific column
pos <- grep(column_pattern, header)
# Get table rows
rows <- html_table %>%
html_nodes("tbody") %>%
html_nodes("tr")
# Loop through rows and extract `href` attribute from `a` at position `pos`
links <- sapply(rows, function(x) {
children <- xml2::xml_children(x)
children[pos] %>% html_nodes("a") %>% html_attr("href")
})
These two are my favorites:
I've been collecting international xc skiing results for many years (since before rvest or tidyverse) but adapted my minimal package for doing it to use rvest.
"Interesting" might be a bit of a stretch, but I used it on a project to detect changes on a webpage. It was basically watching for new versions of a piece of software.
I wrote this a few years back, it was when the function was just called html and no read prefix. It is also about graph databases and modeling head to head competitions. Also interesting to see places where dplyr would make things much cleaner.
http://darrkj.github.io/blog/2014/nov182014/
Here is small example that uses the rvest dsl to build a set of functions to make the whole collection process minimal. It also hits on a few other items like spread/gather and map. I was also thinking it would be a good set to use as a clustering example, how do calculated clusters differ from the AKC groups.
library(stringr)
library(tidyr)
library(purrr)
library(dplyr)
library(rvest)
# Function to take a url and return all of its links.
. %>% read_html %>% html_nodes('a') %>% html_attr('href') -> url_2_links
# get the tags for each
. %>% html_nodes('.characteristic') %>%
html_text %>% grep(':', ., value = T, invert = T) -> get_tag
# and the number of stars
. %>% html_nodes('.star') %>% html_text -> get_score
# and the type of
. %>% html_nodes('.inside-box') %>% .[2] %>%
html_text %>%
str_split('Dog Breed Group: ') %>% { .[[1]][2] } %>%
str_split('Height: ') %>% { .[[1]][1] } -> get_type
# turn the breed page into a tibble
. %>% read_html %>% { tibble( tag = get_tag(.),
score = get_score(.),
type = get_type(.) ) } -> url_2_data
# collect and filter out pages to turn to data
'http://dogtime.com/dog-breeds/groups/' %>%
url_2_links %>%
grep('breeds/', ., value = T) %>%
grep('dog-breeds/groups/', ., value = T, invert = T) -> breeds
# now apply the function to grab data to each url
breeds %>% map(url_2_data) -> df
breed <- breeds %>% str_split('/') %>% map_chr(tail, 1)
for (i in 1:length(df)) df[[i]]$breed <- breed[i]
# turn list of dfs to one df, and widen from key value
df %>% bind_rows %>%
filter(score != ' ') %>%
mutate(score = as.numeric(score)) %>%
spread(key = tag, value = score) -> df
# the names are rough
name <- names(df)
names(df)[-c(1:2)] <- paste0('x', 1:(ncol(df)-2))
# Remove mutts and NA's
df %>% filter(breed != 'mutt') %>% na.omit -> df
I think the third one (Alex Bresler's) is very good explained.
Nevertheless, I've found that where I struggle most when using this package is when trying to identify the CSS
nodes using SelectorGadget (check one example here).
Is there any way to have this process easier?
I iust wrote recently a post about scientific journal article titles using rvest. Would be interesting to go forward and see more things with this.
https://palfalvi.org/post/how-to-give-title-to-your-top-journal-science-article/
I used rvest
to scrape the price of oil in the Czech Republic, and then make a geographic visualization (using tmap
) to show the most expensive pumps (the line of red dots happens to copy the main Czech motorway)
# Initialization ----
library(rvest)
library(tmap)
library(tmaptools)
library(raster)
library(RCzechia) # set of shapefiles for the Czech Republic - devtools::install_github("jlacko/RCzechia")
library(stringr)
library(dplyr)
library(RColorBrewer)
url <- "http://benzin.impuls.cz/benzin.aspx?strana=" # url without page no.
frmBenzin <- data.frame() # empty data frame for data
bbox <- extent(republika) # a little more space around - enough for title and legend
bbox@ymax <- bbox@ymax + 0.35
bbox@ymin <- bbox@ymin - 0.15
# Scraping data ----
for (i in 1:56) { # Scrape data, translate and append to results
impuls <- read_html(paste(url, i, sep = ''), encoding = "windows-1250")
asdf <- impuls %>%
html_table()
frmBenzin <- rbind(frmBenzin, asdf[[1]])
}
# Cleaning data ----
frmBenzin$X1 <- NULL
colnames(frmBenzin) <- c("nazev", "obec", "okres","smes", "datum", "cena")
frmBenzin$cena <- gsub("(*UCP)\\s*Kč", "", frmBenzin$cena, perl = T) # regex is tricky - perl is safer
frmBenzin$cena <- as.double(frmBenzin$cena)
frmBenzin$datum <- as.Date(frmBenzin$datum, "%d. %m. %Y")
frmBenzin$okres <- gsub("Hlavní město\\s","",frmBenzin$okres)
frmBenzin$obec <- str_split(frmBenzin$obec, ",", simplify = T)[,1]
frmBenzin$key <- paste(frmBenzin$obec, frmBenzin$okres, sep = "/")
# Data wrangling ----
frmBenzinKey <- frmBenzin %>%
select(key, cena, smes) %>%
filter(smes == "natural95") %>% # only gasoline - no diesel
group_by(key) %>%
summarise(cena = mean(cena)) # average price in town
obce <- obce_body # from package RCzechia
obce$key <- paste(obce$Obec, obce$Okres, sep = "/") # shapefile: preparing a key to bind on
vObce <- c("Praha", "Brno", "Plzeň", "Ostrava") # big cities - these will be displayed by a polygon, not a point
obce <- obce %>%
append_data(frmBenzinKey, key.shp = "key", key.data = "key") # binding by key
obce <- subset(obce, !is.na(obce$cena)) # throwing out towns with no known oil price
obce <- subset(obce, !obce$Obec %in% vObce) # throwing out the big cities
wrkObce <- obce_polygony[obce_polygony$Obec %in% vObce, ]
# Vizualization at last... ----
nadpis <- "Oil price in the Czech Republic" # Chart title
leyenda <- "Natural 95" # Legend title
endCredits <- paste("data source: Ráádio Impuls (http://benzin.impuls.cz), scraped af of", format(max(frmBenzin$datum), "%d.%m.%Y") ,sep = " ")
tmBenzin <- tm_shape(obce, bbox = bbox) + tm_bubbles(size = 1/15, col = "cena", alpha = 0.85, border.alpha = 0, showNA = F, pal = "YlOrRd", title.col = leyenda) +
tm_shape(republika, bbox = bbox) + tm_borders("grey30", lwd = 1) +
tm_shape(wrkObce) + tm_borders("grey30", lwd = 0.5) +
tm_style_white(nadpis, frame = F, fontfamily = "Roboto", title.size = 2, legend.text.size = 0.6, legend.title.size = 1.2, legend.format = list(text.separator = "-", fun = function(x) paste0(formatC(x, digits = 0, format = "f"), " Kč"))) +
tm_credits(endCredits, position = c("RIGHT", "BOTTOM"), size = 0.6, col = "grey30")
print(tmBenzin)
You can use the developer tools inside the web browser. For example, you can hit F12
in chrome the html for the page will be displayed. You can then navogate through the html and locate the object you are looking for
I'd also recommend checking out Karthik Ram, @garrett, and Scott Chamberlain's UseR2016 tutorial Extracting data from the web APIs and beyond
I'm not sure it qualifies, but I bought my summer house with rvest. Or rather, I used rvest to figure out what I wanted to pay for the house at the auction. Because the house was auctioned off, due to a default, and I couldn't inspect the house inside before I bought it, I wanted to make sure that I didn't pay to much. I geocoded the results and tried to model the prices, but the main result was the distribution of the square meter prices. The distribution surprised me quite a bit as the square meter prices were more spread out than I expected.
I wrapped the functions in a package and uploaded it to github.
Now I just wish Denmark had more days
Thanks for the ideas everyone!
Pulling NOAA weather buoy information
We have also used it in tandem with RSelenium
to scrape public pricing info from an online platform. Don't have the code handy there though unfortunately