cronR job looks like it's running over and over, unsure how to debug?

dougfir · April 6, 2020, 11:02pm

I have a job that runs each morning using cronR. Our data engineer emailed me to say that for a few days over the past week there was an absurdly high number of Athena queries coming from our odbc connection that I have set up on rstudio.

Sure enough, I checked the cron log of the job and it looks like the job is trying over and over. I deduce this from the fact that there's hundreds of lines of output of just the packages loading but not the completion messages of the job itself.

I realise that this is a vague post. I would like to provide more information I am just really at a loss for what or where to look. Any pointers most welcome, any suggestions about where I should look most welcome.

Below is a sanitized copy of the cron output log. You can see the tidyverse loading up again and again whereas it should just appear once when the script runs per day.

When the script does run fine it should accumulate a data frame for 4 time horizons and it outputs a message like so:

fungame: Gathered data for cohort: 2020-03-27 horizon: 30
fungame: Gathered data for cohort: 2020-03-27 horizon: 90
fungame: Gathered data for cohort: 2020-03-27 horizon: 180
fungame: Gathered data for cohort: 2020-03-27 horizon: 365

After this the script combines these prediction vectors onto a data frame and then sends to AWS s3. However, it looks like it's spinning it's wheels since many times I see the message e.g.

fungame: Gathered data for cohort: 2020-03-27 horizon: 30
fungame: Gathered data for cohort: 2020-03-27 horizon: 90

So in this case it only managed to get 2 of the 4 predictions.

If this kind of issue sounds familiar to anyone please share your experience. I have a cron task that seems to be spinning it's wheels and retrying over and over which is a problem since we are billed on a per query basis using AWS Athena.

When I open the cronR gui and load the file my_schedule.cron I see my job:

## cronR job
## id:   fungame_revenue_predictions
## tags: predictions
## desc: Fungame daily revenue predictions
* 14 * * * /usr/lib64/R/bin/Rscript '/home/rstudio-doug/analysis/radhoc/revenue_model/cron_scripts/cron_run.R' fungame 7 >> '/home/rstudio-doug/analysis/radhoc/revenue_model/cron_scripts/cron_run_fungame.log' 2>&1

I'm not to familiar with how cronR and the crontab work or the relationship between them. Could the fact that I have pressed the 'load crontab schedule' within the gui several times duplicated the job? I don't think so because when I run crontab -e in the terminal I only see a single job. I wondered if I might have inadvertantly duplicated the scheduling of this job by reloading the schedule several times. But that might be an irrelevant diversion, I really do not know why this is happening?

Link to cron log since it's too long to post.

technocrat · April 6, 2020, 11:11pm

Just for what it's worth: does the generating code explicitly close the db connection either after success or, if it fails, reschedules itself to run several increasing levels before sending out a cry for human intervention?

dougfir · April 6, 2020, 11:34pm

Hi, no, I do not explicitly close the connection. Here's a very very top level view of the gist of the job:

# Libraries & Functions ----
library(tidyverse)
library(DBI)
library(scales)
library(dbplyr)
library(odbc)
library(rlang)
library(glue)


# Connections
con_s3 <- DBI::dbConnect(noctua::athena(), s3_staging_dir = "s3://ourco-emr/tables/revenue_predictions.db/")
con_athena <- DBI::dbConnect(odbc(), "Athena")
hive_con <- dbConnect(odbc(), dsn = "Hive", Host = "ourhost.com", Port = 10000)


# othergame has less history
if(game_name == 'othergame') {
  horizons <- c(30, 90, 180)
} else {
  horizons <- c(30, 90, 180, 365)
}

# run
## returns a df with the installs for each cohort/date
cohort_installs_query <- read_lines("/home/rstudio-doug/analysis/radhoc/revenue_model/sql_queries/cohort_game_installs.sql") %>% 
glue_collapse(sep = "\n") %>% 
glue_sql(.con = con_athena)
cohort_data <- dbGetQuery(con_athena, cohort_installs_query) # for holding the predictions on each horizon iteration


# Predictions for each time horizon
## loops over each time horizon and then gets a vector of predictions to apend to cohort_data df
for(h in horizons) {
  day_to <- h
  source("/home/rstudio-doug/analysis/radhoc/revenue_model/core/globals.R", local = T) # generate some variables for this loop iterations (has now become a poorly named file)
  source("/home/rstudio-doug/analysis/radhoc/revenue_model/core/predict.R", local = T) # will return prediction_df with predictions for the above cohort
  cohort_data <- cohort_data %>% left_join(prediction_df, by = "s") # join onto cohort's data
  print(paste0(game_name, ": Gathered data for cohort: ", cohort_date, " horizon: ", day_to))
}
  

# Send the assembled data frame with predictions to S3
source("/home/rstudio-doug/analysis/radhoc/revenue_model/core/send_to_s3.R", local = T) # send the predictions to s3
  
# Print cohort completion confirmation message
print(paste0(game_name, ": Done upload to S3 for: ", cohort_date))

That's the gist of it, not sure if that's helpful or not.

technocrat · April 6, 2020, 11:44pm

Try a toy job with an explicit close() and see if it's process runs away also? Just open, minimal command, close and test logs against another minimal without close?

dougfir · April 7, 2020, 12:04am

I will do that just now and update, thanks for the suggestion...

dougfir · April 7, 2020, 12:16am

I have never used the close() command before. Within the context of the above, is the idea I close each script I am referencing or the entire loop script?

e.g.

for (h in horizons) {
  source("somescript.R")
  source("anotherscript.R")
}

source("send_to_s3.R")

close()

Like that?

dougfir · April 7, 2020, 12:24am

close()

Error in UseMethod("close") :
no applicable method for 'close' applied to an object of class "NULL"

I am unsure how to apply close() per your suggestion?

technocrat · April 7, 2020, 12:45am

Right. After you are through with the work for which you need an open connection, if you close the file handle you don't have to worry about zombies

The error indicates that the db connection object had not yet been created.

dougfir · April 7, 2020, 1:00am

Oh you mean to close the db cons? This still results in errors like below. Note that those connections are successful, I do see the connections pane open up each time I run each con_ line.

# connections
con_s3 <- DBI::dbConnect(noctua::athena(), s3_staging_dir = "s3://glu-emr/tables/revenue_predictions.db/")
con_athena <- DBI::dbConnect(odbc(), "Athena")
hive_con <- dbConnect(odbc(), dsn = "Hive", 
                       Host = "emr-long-running.bi.aws.us.gluops.com", 
                       Port = 10000)
 close()
Error in UseMethod("close") : 
  no applicable method for 'close' applied to an object of class "NULL"
 close(hive_con)
Error in UseMethod("close") : 
  no applicable method for 'close' applied to an object of class "c('Hive', 'OdbcConnection', 'DBIConnection', 'DBIObject')"
 close(con_athena)
Error in UseMethod("close") : 
  no applicable method for 'close' applied to an object of class "c('Amazon Athena', 'OdbcConnection', 'DBIConnection', 'DBIObject')"
 close(con_s3)
Error in UseMethod("close") : 
  no applicable method for 'close' applied to an object of class "c('AthenaConnection', 'DBIConnection', 'DBIObject')"

technocrat · April 7, 2020, 1:35am

That probably reflects my state of ignorance how Athena differs from other db connections. I'm now officially out of wacky ideas

dougfir · April 7, 2020, 1:57am

@larefly is it possible to close a noctua::athena connection similar to using close()?

larefly · April 7, 2020, 9:36am

Hi All,

Generally with DBI connections you can call dbDisconnect to close DBI connections. For example:

library(DBI)

# Connect to AWS Athena from odbc
con <- dbConnect(odbc(), "Athena")

# Disconnect from AWS Athena
dbDisconnect(con)

This is similar to how close() works in base R when closing connections.

For noctua, the SDK packages (paws or boto3) don't offer a closing method. So noctua removes the R resource when dbDisconnect is called to mimic a similar affect for example:

library(DBI)

# Connect to AWS Athena using noctua
con <- dbConnect(noctua::athena())

# Get query "select 1" from AWS Athena
dbGetQuery(con, "select 1")
# Info: (Data scanned: 0 Bytes)
#   _col0
#    1


# Disconnect from AWS Athena
dbDisconnect(con)

# Attempt to call same query from AWS Athena
dbGetQuery(con, "select 1")
Error: Connection already closed.

Secondly you can reduce the number of connections to AWS Athena. I have noticed that you make 2 connections to AWS Athena. One with odbc and the other with noctua. What is odbc doing that noctua can't do when connecting to AWS Athena? For example noctua should play nice with dplyr and glue:

library(DBI)
library(glue)
library(dplyr)
library(readr)

con <- dbConnect(noctua::athena())

query <- read_lines("test.sql") %>% 
  glue_collapse(sep = "\n") %>% 
  glue_sql(.con = con)

query
# <SQL> select * from iris

dbGetQuery(con, query)

# dplyr alternative
tbl(con, sql(query))

# Or
tbl(con, "iris")

dbDisconnect(con)

NOTE: noctua speeds up tbl with a AWS Glue call: noctua/NEWS.md at master · DyfanJones/noctua · GitHub. This makes it easier to use dplyr's lazy evaluation and lazy queries (shamelessly prompting a hidden feature )

Final note: You can assume a role for a limited time. This will close the connection from AWS side of things, for example:

library(DBI)

# Connect to AWS Athena for 3 hours before connection closes
con <- dbConnect(noctua::athena(),
                 role_arn = "arn:aws:iam::demo_role",
                 duration_seconds = 10800)

I hope this helps. Please let me know if this doesn't resolve your issue.

larefly · April 7, 2020, 11:55am

Here is an example of noctua playing nice with glue:

library(DBI)
library(glue)

con <- dbConnect(noctua::athena())

# Write iris to AWS Athena
dbWriteTable(con, "iris", iris)

# As "." are converted to "_" for column name to be compliant with AWS Athena
var <- c("sepal_width", "sepal_length", "species")
tbl <- "iris"
num <- 2
val <- "setosa"
(query <- glue_sql("
        SELECT {`var`*}
        FROM {`tbl`}
        WHERE {`tbl`}.sepal_length > {num}
          AND {`tbl`}.species = {val}
        ", .con = con))

#<SQL> SELECT "sepal_width", "sepal_length", "species"
#FROM "iris"
#WHERE "iris".sepal_length > 2
#  AND "iris".species = 'setosa'

# Query AWS Athena
dbGetQuery(con, query)

# Using DBI::Id
var_ids <- lapply(var, function(x) DBI::Id(table="iris", column = x))
(query <- glue_sql("
  SELECT {`var_ids`*}
  FROM {`tbl`}
  WHERE {`tbl`}.sepal_length > {num}
    AND {`tbl`}.species = {val}
  ", .con = con))

#SELECT "iris"."sepal_width", "iris"."sepal_length", "iris"."species"
#FROM "iris"
#WHERE "iris".sepal_length > 2
#  AND "iris".species = 'setosa'

# Query AWS Athena
dbGetQuery(con, query)

dougfir · April 7, 2020, 7:58pm

Thanks! Yes, this works. @technocrat I have added this to my loop so that on each iteration the connections are closed. Will see during the next cron if the behavior changes. When I tested with a manual run right now everything seemed to be fine.

system · April 28, 2020, 7:58pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.