Hello RStudio Community,
I'm after some general advice on how to approach a data caching problem. I have a large SQL query which takes approx. 90 seconds to run, the results of which are consumed in a Shiny app which will--eventually--be deployed via shinyapps.io.
The database I'm pinging is updated daily during the early hours of the morning, before everyone gets into work. Due to our shinyapps deployment, my understanding is that it wouldn't be possible to schedule a cron job on the server to run at, say, 6am to save a .Rdata file to Amazon S3 (or equivalent) for faster data loads.
I'm therefore envisioning a solution whereby the first user of the application on a given day creates a 'day cache' for all subsequent users of the application. The first user would have to wait the full 90 seconds to grab the data, the data are cached to S3 and then all subsequent users of the app on that day benefit from much faster application load times.
After some research, I've found several caching packages which might help, including memoise
and cachem
. While I understand the concept of caching and read Winston Chang's excellent post here, I'm not sure it fully gets at my requirement for a daily refresh of the cache. My understanding is perhaps a little wonky but, as my data-grab function doesn't change--the SQL and function calls remain the same--I'm not sure how the app would 'know' that the cache needs to change?
Any help and advice warmly received!
Cheers!
you can key on the date. If the current date is not found in the cache then you create it and make it corresponding to todays data. if it is found, then its because you created it and it corresponds to the days data.
Thanks for the reply @nirgrahamuk. When referring to keying on date, is this within the memoise
or cachem
packages? Are you aware of any tutorials that run through this particular scenario that you could perhaps reference? Thanks!
I had in mind, cachem.
The cachem vignette is clear and concise from my perspective
Thanks for the suggestion. However, my experience in developing caching code is quite limited; what's clear and concise for you isn't necessarily so for others.
It isn't particularly clear how cachem
can save to Amazon S3 (there is disk caching but it appears to be local as shown in the vignette/readme) and I didn't find reference to keying by date. I will continue to research this package, however, in the hope of finding a solution.
If the server is yours then you have access to its local file system and the disk cache would make sense. If you arent hosting but using shinyapps.io then you won't have that freedom. However it seems if you want to save to s3, maybe just separate that out from your app. And schedule a cron job or similar to set up your s3 as you need it each day.
I did consider the cron job approach, but it's another server that I'd need to maintain. It's not out of the question if that's the best way of solving the problem, but I was hoping for a shinyapps.io contained system.
In my mind, I was envisioning a short block of cache code that's executed on app load. It checks Amazon S3 to see if a cache file exists for the current date. If it does, it's loaded into the app and away you go.
If a cache file doesn't exist (i.e. a user opens the app at 8am, before anyone else), create the cache, save it to S3 and the delete the previous day's cache. When subsequent users access the app, the day's cache check will pass and a new cache file isn't generated.
If this is completely unworkable, I'll natrually have to consider cron jobs.
I dont use amazon s3. What functions does it provide for you to check existence of data ? A regular file system can be checked with base::file.exists()
Same, AWS is new for me, too. I believe you can check bucket contents with aws.s3::get_bucket()
(more information here).
A thought I've just had; perhaps it's just a case of reading and writing .Rdata files to S3, rather than using cache libs such as cachem
? As long as I name each data file in a sensible and predictable manner (e.g. today's date: 20211115.rdata) then it would just be a case of (pseudocode):
# Get bucket contents
bucket<-aws.s3::get_bucket(...)
# Check if "today's date" exists
if(paste0(Sys.Date(), ".rdata") %in% bucket) {
my_data<-aws.s3::s3load(paste0(Sys.Date(), ".rdata"))
} else {
# Get data from database
dat<-DBI::dbSendQuery(my_db, my_sql_query)
# Save data to S3 bucket
aws.s3::s3save(dat, object = paste0(Sys.Date(), ".rdata"), bucket = my_bucket)
# Delete previous day's data
delete_object(paste0(Sys.Date()-1, ".rdata"))
}
I'm sure there are some disadvantages to this approach I haven't considered, but it could work?
Thanks for continuing to pitch in and be a sounding board for my chaotic thought process.