I'm looking to find a method retry some web-scraping functions in a package I'm writing. Right now, the simplest method I've found is warrenr::persistently(), which works fine, but I'm trying to reduce my package's dependencies.
Any ideas?
If you want to see a reprex for whatever reason, here's a function that sometimes poses issues:
I'm not sure it's a better solution than using warrenr::persistently() though (it does reduce the number of dependencies since you're already using the tidyverse).
However, keep in mind that you should not overload their servers with calls. Take also a look at {polite} to scrape politely: https://github.com/dmi3kno/polite
Thanks for the reply. So I actually had been unfamiliar with robots.txt before this. Does that crawl-delay mean that there will be a forced delay of 30 seconds? Or does that mean if I don't set a manual delay for 30 seconds that my request won't be fulfilled?
Neither, necessarily. robots.txt is purely advisory—a standardized way for sites to set suggested limits on scraping. That said, site admins can absolutely block your IP if you cause undue stress on their website. Scraping a few dozen pages is unlikely to catch anyone's notice, but scraping thousands of pages in parallel is much more likely to cause a problem. Obeying robots.txt lets scrapers get what they need without causing problems.