Disclaimer: this question does not contain any reproducible code
I wrote a script to scrape this website: https://gasprices.aaa.com/ with the {targets} package. Since the data on the wesbsite is updated everyday, I would like to schedule the script to run once a day with Github Actions. The issue I think I will face; however, is that since there will essentially be no change in my script whenever it is reexecuted everyday, then no scraping will happen. I am using {targets}, so I think that I will just get the usual skip pipeline messages in the log.
Is my line of thinking correct? And if so, how can I solve this issue?
Here is a small reproducible exemple of my _targets.R file:
I know that one solution could be to delete the _targets/ directory in the project . This will force the whole pipeline to rerun everyday; however, I see this solution as a hack.
Maybe try tar_target(…, “your_url.com/file.csv”, format =“url”). That target will check the file at the url using the last modified timestamp and ETag if availabile and invalidate the target automatically if either changes. For running targets on GitHub actions, the tar_github_actions() function generates a workflow file, and GitHub.com/wlandau/targets-minimal is an example.
Thanks for your response @wlandau and, especially, a great package.
Actually, the script is not downloading a file, it is actually scraping data from a webpage and storing the data locally (which is later pushed to a Github repo). So, I am not quite sure how to use your suggestion in this case: tar_target(…, “your_url.com/file.csv”, format =“url”). Here is an example webpage that is scraped: AAA Gas Prices
Someone in a Slack channel I belong to suggested using a cue, which I had never heard about. Would this be a possible solution to explore?
Thank you.
Also, thank you for letting me know about tar_github_actions().
EDIT
I added the cue = tar_cue(mode = "always") argument to the main scraping target, however, it does not seem to run on Github Actions. It runs perfectly on my computer, though.
EDIT 2
I modified the targets.yaml file a bit and now it works as intended. The yaml file produced by tar_github_actions() contains several lines, which I do not understand. I just rewrote it with simpler tasks (e.g. installing packages manually...). This is what it looks like now:
# Hourly scraping
name: us_gas_prices_scraper
# Controls when the action will run.
on:
push:
branches:
- main
- master
jobs:
autoscrape:
# The type of runner that the job will run on
runs-on: macos-latest
# Load repo and install R
steps:
- uses: actions/checkout@master
- uses: r-lib/actions/setup-r@master
# Set-up R
- name: Install packages
run: |
R -e 'install.packages(c("targets", "rvest", "dplyr", "stringr", "purrr", "here", "glue"))'
- name: Run scraper
run: |
Rscript _targets.R
R -e 'targets::tar_make()'
# Add new files in data folder, commit along with other modified files, push
- name: Commit files
run: |
git config --local user.name github-actions
git config --local user.email "actions@github.com"
git add .
git commit -am "US gas price data scraped on $(date)"
git push origin master
env:
REPO_KEY: ${{secrets.GITHUB_TOKEN}}
username: github-actions
Great, sounds like you solved the issue. And yes, a cue is a good workaround. tar_cue(mode = "always") is great if you always want to scrape the data. Then, if the hash of that data did not change since last run, the downstream targets may be skipped. tarchetypes::tar_change() is another way to go about this if you have some way of checking the modification time etc. of the website you are scraping, but that may not be necessary in your case if the actual scraping step is computationally efficient.