Best practices (security-wise) for html scraping with rvest?

matthieu · June 23, 2020, 10:32pm

Hi, this is my first post on RStudio Community, and I hope this is the right place to ask my question. Please do let me know if I should move this to a more suitable location! I have a very general and naive question about best security practices to follow when scraping html pages from R.

Context:

I am currently working on an R package project to search and download works from the Archive of Our Own (AO3), which is a volunteer-run archive of transformative fanworks such as fanfictions. I was inspired by the gutenbergr package by David Robinson, which is a tool I really like! This is a pet project of mine, but I think it might be useful for e.g. text mining of the fanfiction works produced by various fandom communities, how they can relate or depart from the original works, and such. I am still discovering and learning more about AO3, but it seems to me that such a large corpus of community-made works could be a great resource for many natural language processing analyses.

The goal of my R package would be to provide simple tools to search the AO3, to parse some AO3 pages to get exhaustive metadata associated with a given fandom or tag, and to allow the user to download works of interest once those have been identified.

As a toy example, here is a short example of code that can be used to retrieve and parse some html from AO3 using rvest:

# This code downloads and parses the list of fandoms in the "Books &
# Literature" section of the Archive or Our Own.

# Packages
library(curl)
library(rvest)
library(tidyverse)

# Helper function to parse fandom data stored in a <li></li> node
parse_fandom <- function(li_node) {
    fandom <- rvest::html_children(li_node)
    stopifnot(length(fandom) == 1)
    fandom <- fandom[[1]]
    count <- strsplit(rvest::html_text(li_node), "\n")[[1]]
    count <- count[grepl("^[ ]*[(][0-9]*[)]$", count)]
    stopifnot(length(count) == 1)
    count <- gsub("[()]", "", gsub(" ", "", count))
    count <- as.numeric(count)
    href <- rvest::html_attr(fandom, "href")
    fandom <- rvest::html_text(fandom)
    return(tibble::tibble(fandom = fandom, count = count, href = href))
}

# Download the html content of the "Books & Literature" index page of AO3
example_url <- "https://archiveofourown.org/media/Books%20*a*%20Literature/fandoms"
page <- curl_fetch_memory(example_url)
content <- rawToChar(page$content)
html <- read_html(content)

# Extract data for each fandom listed in the html
# (takes a few seconds)
fandoms <- html_nodes(html, xpath = "//li[a[@class='tag']]")
z <- bind_rows(lapply(fandoms, parse_fandom))

# z is a tibble with our metadata of interest
z

# Plotting the work counts per fandom versus their rank (just for fun)
ggplot(z %>% arrange(desc(count)),
       aes(x = seq_len(nrow(z)), y = count)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10")

My main question:

Should I take any precaution when parsing html pages from a third-party website, especially a website like AO3 where pages are generated from material posted by users?

I know that one should generally be careful when parsing user data and use some sanitization code to avoid things such as e.g. SQL injection when the user-provided text is used for some action. I am not aware of any mechanism by which malicious code could be put into a served html page and have bad effects when the html is parsed as text, but I have no expertise whatsoever in this respect, and that's why I was hoping to hear from more knowledgeable people in this forum

What I have searched and found:

Some basic Google searches about security and web-scraping didn't return anything useful to me (most of the returned results were actually about how to make web scraping of your own website difficult or impossible).
I have looked into the vignettes for the curl and rvest R packages, but I haven't found any warning about precautions to take during web scraping.
I checked in the source code of the gutenbergr package, and for example the function gutenberg_download() calls read_url_file(), which itself is simply a wrapper around utils::download.file() and readr::read_lines(), but I didn't see any extra steps taken to validate/sanitize the returned text.
There are several Python projects out there providing AO3 scraping capabilities, including one called ao3 on PyPI. I have (quickly) browsed through some of its code, but I didn't find any indication of any special precaution taken when parsing html content (but again my knowledge is limited, and maybe the Beautiful Soup library they are using might take care of this behind the scenes?).

What I would like to do safely from AO3 html pages:

Parse metadata listed in tag section pages e.g. Sherlock Holmes - Arthur Conan Doyle - Works | Archive of Our Own
Using metadata from the pages above, build urls pointing to each work to allow for downloading the actual work text. Those urls would be built by concatenating the url root https://archiveofourown.org/works/ and the work ID parsed from the metadata.
Ultimately, store the parsed data and/or downloaded text into tibbles, and return those tibbles to the R user.

Any advice about precautions I should take when crunching html pages, or pointer towards useful resources, would be very welcome!

My apologies about the lengthy post! I realize I might be overly cautious in asking about this, but given my lack of expertise in this domain I would prefer to ensure that my R code does create any security risk for a potential R user in the future

Thank you!

system · July 14, 2020, 10:32pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.