I just had to modify my scraping code again since yet another website migrated their COVID-19 data to the ubiquitous ArcGIS template. I hate those dashboards, especially the map with bubbles on it. That has to be the worst possible way to illustrate the data. But anyway, as a service to the community, I thought I would document how I scrape these sites, to help anyone else who may be trying to figure it out. I don't claim this is the best way to do it, all I can claim is that it works for me. This data happens to be for the Texas prison system.
For discovering the magical XML text to pull out the desired part of the page, I use the built-in inspect function in Chrome or Firefox to highlight the relevant section of the page and then right-click to get Copy->Xpath. But there are numerous on-line references on how to do this.
I run several of these every evening on a cron job.
library(tidyverse)
library(stringr)
library(xfun) # because RSelenium needs it internally
url <- "https://txdps.maps.arcgis.com/apps/opsdashboard/index.html#/dce4d7da662945178ad5fbf3981fa35c"
# start the server and browser in headless mode
rD <- RSelenium::rsDriver(browser="firefox",
extraCapabilities = list("moz:firefoxOptions" = list(
args = list('--headless')))
)
driver <- rD$client
# navigate to an URL
driver$navigate(url)
Sys.sleep(9)
# get parsed page source
parsed_pagesource <- driver$getPageSource()[[1]]
#close the driver
driver$close()
#close the server
rD$server$stop()
# Save in case the rest of the code crashes, like when they update the page on you
saveRDS(parsed_pagesource,paste0("/home/ajackson/Dropbox/Rprojects/Covid/DailyBackups/",lubridate::today(),"_ParsedPagePrisons.rds"))
#---------------------------------------------------------------------
# Extract prison info
#---------------------------------------------------------------------
result <- xml2::read_html(parsed_pagesource) %>%
# select out the part of the page you want to capture
rvest::html_nodes(xpath='//*[@id="ember194"]') %>%
# convert it to a really long string, getting rid of html
rvest::html_text() %>%
# there are a lot of carriage returns in there, let's clean them out
str_replace_all("\n"," ") %>%
# Split string on long strings of spaces, returning a list
str_split(" +")
# get rid of title and extra line at end
result <- result[[1]][3:(length(result[[1]])-1)]
# every other element of list is a Unit, so let's combine the Unit name
# with the table it used to head, to get the first iteration of a data frame
res <- cbind.data.frame(split(result,
rep(1:2, times=length(result)/2)),
stringsAsFactors=F)
#assign some better names
names(res) <- c("Unit", "foo")
res <- res %>%
# add dash after numbers for later splitting
mutate(foo=str_replace_all(foo, "(\\d) ", "\\1 -")) %>%
# remove all whitespace, some are tabs
mutate(foo=str_remove_all(foo, "\\s*")) %>%
# remove commas from numbers
mutate(foo=str_remove_all(foo, ",")) %>%
# split the field into 12 pieces
separate(foo, letters[1:12], sep="-") %>%
# select out the numeric fields
select(Unit, b,d,f,h,j,l) %>%
# make them numeric
mutate_at(c("b","d","f","h","j","l"), as.numeric)
# give every field a bright, shiny new name
names(res) <- c("Unit",
"Offender Active Cases",
"Offender Recovered",
"Employee Active Cases",
"Employee Recovered",
"Medical Restriction",
"Medical Isolation")
# add a field with today's date
res <- res %>% mutate(Date=lubridate::today())
# let's see what it looks like - this is for QC
res
# now save or do whatever.....