I am using the noctua pakage to send data to a Hive table that is stored on AWS S3.
con_s3 <- DBI::dbConnect(noctua::athena(), s3_staging_dir = "s3://ourco-emr/tables/revenue_predictions.db/")
dbWriteTable(conn = con_s3,
name = paste0("revenue_predictions.", game_name),
value = prediction_df, # my data frame
append = F,
overwrite = F, # don't have permissions to do this anyway
file.type = "parquet",
partition = c(year = yr, month = mt, day = d),
s3.location = paste0("s3://ourco-emr/tables/revenue_predictions.db/", game_name)
)
This creates new parquet files on S3 along with partitions. This code block runs daily as part of a cron job.
The set up I'm working on has 3 distinct entities:
- An EMR cluster where hive lives and is controlled by our data engineers
- An S3 bucket which stores our Hive tables
- Our analytics EC2 server where rstudio is hosted.
Each morning the code block above runs on the EC2 RStudio instance as part of a cron job. This sends the data (predictions created with R) to S3.
So far so good. The predictions appear as desired and correctly partitioned on S3. The trouble is, each morning after the job has run I need to get Hive to recognize the newly created partition for the previous day (partitions are year, month and day).
I do this by logging into EMR in the terminal and opening hive
then I run MSCK repair table [tablename]
. After I do this the hive table detects the newly added data and all is good.
An alternative to the MSCK
command that I learned was to run alter table like so:
ALTER TABLE schemaname.tablename ADD IF NOT EXISTS
PARTITION (year='2020', month='03', day='24')
location 's3://ourco-emr/tables/adhoc.db/mtcars/2020/03/24
My question is, is there some way to automate this last step as part of my cron job within R?r ggplo
On the EC2 server I have odbc connections with both S3 and Hive. However, looking at the documentation for DBI, I can only run dbWriteTable and dbSendQuery and these are for select queries only.
Some fall back solutions:
- Create a shell script on the emr and run it every e.g. 30 minutes with the hive command
MSCK repair table [tablename]
. Highly un-elegeant. - Create empty partitions on hive till e.g. the end of the year and run
MSCK repair table [tablename]
ahead of time to get hive to recognize all partitions till the end of the year. Then come Jan 1st just repeat. Also un-elegant.
Preference would be to get hive to recognize the partitions right after they are created. Is there a way of doing this using my existing Hive or S3 odbc connections on the EC2 server?