I need help automating an R script with Github Actions and maybe Docker

ML_Rookie_2021 · May 25, 2024, 8:14pm

I need help automating an R script to pull data from an API. Right now I'm just looking for a minimum viable solution. I know I can do this with some combination of Github Actions and Docker, however I'm knew to both technologies and I've tried to adapt other solutions from videos and tutorials to my use case, but I can't get it to work.

Here is a tree of my current project -

├── README.md
├── data
├── dev
│   └── R
│       ├── extract_shelter_data.R
│       └── required_packages.R
└── my_current_project.Rproj

Here are the contents of extract_shelter_data.R -


setwd(here::here("dev", "R"))

library(tidyverse)
library(opendatatoronto)
library(janitor)

get_shelter_data <- function(year = 2024) {
       
    # Oped Data API Info
    info <- opendatatoronto::show_package("21c83b32-d5a8-4106-a54f-010dbe49f6f2") %>% 
        list_package_resources() %>% 
        filter(str_to_lower(format) %in% c("csv", "geojson")) %>% 
        filter(! is.na(last_modified)) %>% 
        arrange(desc(last_modified)) %>% 
        mutate(last_modified_year = lubridate::year(last_modified))
    
    info_2 <- info %>% 
        filter(last_modified_year == year) %>% 
        arrange(desc(last_modified)) %>% 
        head(1)
    
    # Info Check
    if (is.null(info) || length(info) == 0) {
        stop("No API info extracted! Check API info code chunk", call. = FALSE)
        msg = "No API info extracted!"
    }
    
    # Data Extract (Open Data API)
    data <- info_2 %>% 
        get_resource() %>% 
        janitor::clean_names() %>% 
        mutate(occupancy_date = lubridate::ymd(occupancy_date)) %>% 
        head(5)
    
    # Data Check
    if (is.null(data) || length(data) == 0) {
        stop("No data extracted! Check data chunk", call. = FALSE)
        msg = "No data extracted!"
    }
    
   ret <- data %>% mutate(time = Sys.time())
    
    # Return
    return(ret)
    
}

shelter_raw_tbl <- get_shelter_data()

save_path <- str_glue("../../data/shelter_raw_tbl_{Sys.time()}.csv")

shelter_raw_tbl %>% write_csv(save_path)

My goal is to automate running the extract_shelter_data.R script to pull data from an API and save in the data folder. Eventually I would be saving the data to a BigQuery database, but right now I just want to save locally while while I get it to work.

I also have a .github/workflows folder in my root directory. with a schedule.yml file in it. Here are the contents of the file. I got this using ChatGPT -

name: Schedule R Script

on:
  schedule:
    - cron: '*/5 * * * *'
  workflow_dispatch:

jobs:
  run-script:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout repository
      uses: actions/checkout@v3

    - name: Setup R
      uses: r-lib/actions/setup-r@v2

    - name: Install R packages
      run: |
        Rscript dev/R/required_packages.R

    - name: Run R script
      run: |
        Rscript dev/R/extract_shelter_data.R

    - name: Commit and push changes
      run: |
        git config --global user.name "${{ secrets.GITHUB_ACTOR }}"
        git config --global user.email "${{ secrets.GITHUB_ACTOR }}@users.noreply.github.com"
        git add -A
        git commit -m "Automated data extraction update"
        git push
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Unfortunately the install R packages step just runs endlessly.

What would be the right configurations for a schedule.yml file to schedule this script using Github Actions.
If I need to use Docker also, what would be the appropriate configurations for a Dockerfile.

Please do let me know if I should be providing any additional info.

dougfir · May 30, 2024, 12:06am

What does required_packages.R look like?

dougfir · May 30, 2024, 12:38am

Unfortunately the install R packages step just runs endlessly.

I'm giving this a try over here. I think if you are installing tidyverse each time on a runner it will take quite a while. As of typing it's still installing tidyverse.

A way around this like you suggested would be to use Docker. Last I tried GHA caches the container so on each run it builds much faster rather than freshly downloading each time. If you are able to develop the image with docker and get that working, creating a 'container action' will allow you to run it within schedule.yml.

dougfir · May 30, 2024, 2:50am

I was unable to get this running in GHA, I struggled to install tidyverse package. See history here.

There are dependencies that need to be installed onto the host runner which is why it's failing.

I would try using a verse docker image that comes with tidyverse already, then running it as a container action.

Gabor · May 30, 2024, 6:52am

You mean that it is frozen? Or it just takes a long time?

If the latter then you can cache the packages with the actions/cache action [1].

OTOH I suggest you use the r-lib/actions/setup-r-dependencies action [2] to install R packages, it has a built-in cache and it uses binary packages from Posit Package Manager, so they install pretty fast. The best is to create a dummy DESCRIPTION file, and then the packages are picked up automatically. See e.g. the Tidyverse design book [3] for an example.

[1] GitHub - actions/cache: Cache dependencies and build outputs in GitHub Actions
[2] actions/setup-r-dependencies at v2-branch · r-lib/actions · GitHub
[3] GitHub - tidyverse/design: Tidyverse design principles

ML_Rookie_2021 · May 30, 2024, 8:13pm

@Gabor To answer your question, it does not freeze. The install step just runs endlessly and never completes.

To your suggestion. so I create a Description file and add this code with any additional packages I need?

steps:
- uses: actions/checkout@v4
- uses: r-lib/actions/setup-r@v2
- uses: r-lib/actions/setup-r-dependencies@v2
  with:
    cache-version: 2
    extra-packages: |
      any::ggplot2
      any::rcmdcheck
    needs: |
      website
      coverage

Also, what should my schedule.yml file look like?

ML_Rookie_2021 · May 30, 2024, 8:21pm

Thanks @dougfir. To answer your questions. This is the contents of required_packages.R file

p <- c("dplyr", "stringr", "lubridate", "readr", "janitor", "here", "opendatatoronto")

pkg <- .packages(all.available = TRUE)
for(i in p){
    if(!i %in% pkg){
      message("Package", i, "is not installed. Installing the package:")
      install.packages(i)
    }

}

You suggested using the rocker/verse image. Is this as easy as including the image link in the schedule.yml file. I definitely need to upskill on Docker to fully understand what I'm doing lol.

system · August 28, 2024, 8:21pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.