Rvest, how do you use it to scrape data regularly?

peacefultom · September 27, 2017, 1:18pm

One of the main use cases for web scraping, at least for me, would be to scrape some data from a website and then keep the dataset updated when the web site updates. How do you do with,

Running the script in the cloud and scheduling? I wish there was something like AWS lambda for R. Keeping an instance running 24/7 when I only need to run the script once a day seems overkill.

Updating the dataset with only new entries? Ideally one would only download new entries but it seems pretty hard. If you get duplicates you might need to merge them with the old data somehow.

martj42 · September 27, 2017, 4:29pm

I use taskscheduleR(https://github.com/bnosac/taskscheduleR) to run a few things a few times a day on my laptop.

JidduAlexander · September 27, 2017, 4:33pm

I have some similar things running.

Updating the dataset with only new entries shouldn't be a problem. If the duplicated data is the exact same then a dplyr::full_join(old_data, new_data) should take care of that.

For scheduling I didn't find a solution, simply because the two minutes to run the script manually has not been enough of a hassle to look for a solution. But I might try this taskscheduleR that martj42 just mentioned.

Best,
Jiddu

Ranae · September 27, 2017, 6:27pm

Probably overkill if your result is just one dataset, but I have seen people use Travis CI to automate regular web scraping.

jlacko · September 27, 2017, 9:01pm

I have a script that needs to be run hourly - technically not using rvest, but twitteR; in principle rather similar to what you describe.

To achieve this I have set up an AWS instance (the Free Tier is rather accommodating, so my expense is about $2 per month) running R Studio Server and cronR package to set up regular CRON jobs.

The cronR package integrates nicely with R Studio as Add-in and does exactly what I require: executes a R script in regular intervals. Nothing more, nothing less. Highly recommended!

As for updating database for new entries only: this of course depends on application. Personally I found it easier to handle duplicities on database side and not in R - it would mean (anti) joining an in memory data frame with a remote tibble.

You would either need to pull your full dataset into R, find only new values and insert them into your database - or insert new values to a temporary table and in database find new values only and insert these to the full dataset.

The second option is often easier - in Postgres (which I use) it involves only calling insert into XYZ select * from ABC on conflict (id) do nothing, which is easier (with big datasets much easier) than getting the data into memory and finding duplicates via R.

tyler · September 27, 2017, 9:29pm

Assuming you want minimal oversight over your web scraper, I would also suggest making extensive use of the try(...) & tryCatch(...) functions wrapping the functions you use to scrape. Depending on the source, you might encounter malformed HTML and this can often cause errors and a failure to record your data.

raybuhr · September 29, 2017, 6:26am

Excellent advice. I scrape a few sites for deals and push them into a slack channel every few minutes for a side hustle. The websites often change URLs, HTML tags, div names, etc. Trying to keep up with it all is a pain, but if you have good error handling and logging it makes it way more manageable.

For managing my web scrapers, I run them on a VM on Google cloud. $300 credit for a year for each email, and Gmail accounts are free... Plus free tier options means you can start using a lot and scale down to free pretty easily.

I use Jenkins pipelines for building, executing and logging the code that runs. Jenkins is usually used in web development for testing and deployment based on changes to source control, but it is actually just a super deluxe crontab app. It's a lot better than just cron or task scheduler in Windows because it stores the results of multiple builds (i.e. schedules, runs, executions, etc.) and automatically passes anything sent to stdout to the log file so you don't need a bunch of extra logging statements in your code. Best of all, it's super easy to install and use on both Linux and windows and has a boat load of documentation online.

For comparing old data versus the current data from the most recent scrape, I store results in a MySQL database and call an INSERT ... ON DUPLICATE KEY UPDATE ... statement.