Anyone have a good way to retry a function?

eoppe1022 · August 20, 2018, 4:33am

I'm looking to find a method retry some web-scraping functions in a package I'm writing. Right now, the simplest method I've found is warrenr::persistently(), which works fine, but I'm trying to reduce my package's dependencies.

Any ideas?

If you want to see a reprex for whatever reason, here's a function that sometimes poses issues:

library(tidyverse)
library(progress)

get_teams <- function(.league, .season, .progress = FALSE, ...) {
  
  leagues <- .league %>% 
    as_tibble() %>% 
    set_names(".league") %>% 
    mutate(.league = str_replace_all(.league, " ", "-"))
  
  seasons <- .season %>%
    as_tibble() %>%
    set_names(".season")
  
  mydata <- tidyr::crossing(leagues, seasons)
  
  if (.progress) {pb <- progress::progress_bar$new(format = ":what [:bar] :percent eta: :eta", clear = FALSE, total = nrow(mydata), width = 60)}
  
  league_team_data <- map2_dfr(mydata[[".league"]], mydata[[".season"]], function(.league, .season, ...) {
    
    if (.progress) {pb$tick(tokens = list(what = "get_teams()"))}
    
    seq(5, 10, by = 0.001) %>%
      sample(1) %>%
      Sys.sleep()
    
    page <- str_c("https://www.eliteprospects.com/league/", .league, "/", .season) %>% read_html()
    
    team_url <- page %>% 
      html_nodes("#standings .team a") %>% 
      html_attr("href") %>%
      str_c(., "?tab=stats") %>%
      as_tibble() %>%
      set_names("team_url")
    
    team <- page %>%
      html_nodes("#standings .team a") %>%
      html_text() %>%
      str_trim(side = "both") %>%
      as_tibble() %>%
      set_names("team")
    
    league <- page %>%
      html_nodes("small") %>%
      html_text() %>%
      str_trim(side = "both")
    
    season <- str_split(.season, "-", simplify = TRUE, n = 2)[,2] %>%
      str_sub(3, 4) %>%
      str_c(str_split(.season, "-", simplify = TRUE, n = 2)[,1], ., sep = "-")
    
    all_data <- team %>%
      bind_cols(team_url) %>% 
      mutate(league = league) %>%
      mutate(season = season)
    
    return(all_data)
    
  })
  
  return(league_team_data)
  
}

brodriguesco · August 20, 2018, 5:04am

You can try with purr::possibly. I wrote a blog post that details this: http://www.brodrigues.co/blog/2018-03-12-keep_trying/

I'm not sure it's a better solution than using warrenr::persistently() though (it does reduce the number of dependencies since you're already using the tidyverse).

However, keep in mind that you should not overload their servers with calls. Take also a look at {polite} to scrape politely: https://github.com/dmi3kno/polite

alistaire · August 20, 2018, 5:41am

That site has a crawl-delay of 30s, so set parameters accordingly if scraping multiple pages.

httr::RETRY may also be useful for intermittently functional pages. httr is a dependency of rvest, so it won't add to your dependency tree.

eoppe1022 · August 20, 2018, 1:27pm

Thanks for the reply. So I actually had been unfamiliar with robots.txt before this. Does that crawl-delay mean that there will be a forced delay of 30 seconds? Or does that mean if I don't set a manual delay for 30 seconds that my request won't be fulfilled?

eoppe1022 · August 20, 2018, 1:34pm

Awesome. I'm gonna look into all of that. Nice blog post by the way. I really enjoyed reading that

alistaire · August 20, 2018, 7:23pm

Neither, necessarily. robots.txt is purely advisory—a standardized way for sites to set suggested limits on scraping. That said, site admins can absolutely block your IP if you cause undue stress on their website. Scraping a few dozen pages is unlikely to catch anyone's notice, but scraping thousands of pages in parallel is much more likely to cause a problem. Obeying robots.txt lets scrapers get what they need without causing problems.

More info on robots.txt:

Some principles of scraping responsibly:

https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01

Example in R:

Package for checking robots.txt from R:

eoppe1022 · August 21, 2018, 2:51am

Thanks for the info! Definitely good to know while making a web-scraping package