I need help automating an R script to pull data from an API. Right now I'm just looking for a minimum viable solution. I know I can do this with some combination of Github Actions and Docker, however I'm knew to both technologies and I've tried to adapt other solutions from videos and tutorials to my use case, but I can't get it to work.
Here is a tree of my current project -
├── README.md
├── data
├── dev
│ └── R
│ ├── extract_shelter_data.R
│ └── required_packages.R
└── my_current_project.Rproj
Here are the contents of extract_shelter_data.R
-
setwd(here::here("dev", "R"))
library(tidyverse)
library(opendatatoronto)
library(janitor)
get_shelter_data <- function(year = 2024) {
# Oped Data API Info
info <- opendatatoronto::show_package("21c83b32-d5a8-4106-a54f-010dbe49f6f2") %>%
list_package_resources() %>%
filter(str_to_lower(format) %in% c("csv", "geojson")) %>%
filter(! is.na(last_modified)) %>%
arrange(desc(last_modified)) %>%
mutate(last_modified_year = lubridate::year(last_modified))
info_2 <- info %>%
filter(last_modified_year == year) %>%
arrange(desc(last_modified)) %>%
head(1)
# Info Check
if (is.null(info) || length(info) == 0) {
stop("No API info extracted! Check API info code chunk", call. = FALSE)
msg = "No API info extracted!"
}
# Data Extract (Open Data API)
data <- info_2 %>%
get_resource() %>%
janitor::clean_names() %>%
mutate(occupancy_date = lubridate::ymd(occupancy_date)) %>%
head(5)
# Data Check
if (is.null(data) || length(data) == 0) {
stop("No data extracted! Check data chunk", call. = FALSE)
msg = "No data extracted!"
}
ret <- data %>% mutate(time = Sys.time())
# Return
return(ret)
}
shelter_raw_tbl <- get_shelter_data()
save_path <- str_glue("../../data/shelter_raw_tbl_{Sys.time()}.csv")
shelter_raw_tbl %>% write_csv(save_path)
My goal is to automate running the extract_shelter_data.R
script to pull data from an API and save in the data
folder. Eventually I would be saving the data to a BigQuery database, but right now I just want to save locally while while I get it to work.
I also have a .github/workflows
folder in my root directory. with a schedule.yml
file in it. Here are the contents of the file. I got this using ChatGPT -
name: Schedule R Script
on:
schedule:
- cron: '*/5 * * * *'
workflow_dispatch:
jobs:
run-script:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Setup R
uses: r-lib/actions/setup-r@v2
- name: Install R packages
run: |
Rscript dev/R/required_packages.R
- name: Run R script
run: |
Rscript dev/R/extract_shelter_data.R
- name: Commit and push changes
run: |
git config --global user.name "${{ secrets.GITHUB_ACTOR }}"
git config --global user.email "${{ secrets.GITHUB_ACTOR }}@users.noreply.github.com"
git add -A
git commit -m "Automated data extraction update"
git push
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Unfortunately the install R packages
step just runs endlessly.
- What would be the right configurations for a
schedule.yml
file to schedule this script using Github Actions. - If I need to use Docker also, what would be the appropriate configurations for a
Dockerfile
.
Please do let me know if I should be providing any additional info.