Shiny dashboard with refreshed data from remote storage (bigQuery)

spiritus87 · December 1, 2017, 11:57pm

Hi there,

Up to now I have only played with shiny locally with small datasets that were loaded from csv directly into ram. I would like now to develop a dashboard for use within my company that displays a year of daily time-series data from the current day from a remote data source (bigQuery). This data lives in a (virtualized) view on bigquery; every time the view is queried one-year worth of data (365 days) from the current day is returned.

I guess I'm a bit (very) confused as it's the first time I do this, but I have some basic questions about best practices I couldn't find a straightforward answer to:

is it a good idea to write the shiny app to query the needed data from the database, via e.g. dplyr, at startup? I suspect this would slow down the startup of the app.
Additionally, if users A, B, C, ... were running sessions on the app at the same time, they would each send queries to the remote source for the data to be loaded into memory; wouldn't this lead to many copies of the same data to be loaded into memory, thus exhausting it? (this would be particularly painful for bigQuery as you are billed per GB queried) (i hope you understand what I mean...)
is it better practice to set up a e.g. cron job to pull the needed data on the machine hosting the shiny server (perhaps as .RData, or as monetDBlite?) and let the shiny app to load the data locally? I imagine this would speed up the startup time. Users would still duplicate the data in ram though (is there a way to load a dataset in ram and let every app that is spin up access that dataset?) and for a production environment running cron jobs like this seems a bit spooky.

Sorry for the confused question but.. well, I'm confused

Riccardo.

edgararuiz · December 2, 2017, 2:39am

Hi, the ideal setup would be for the Shiny app to only collect results from the database. The results could come from pushing the computation to the database, via dplyr, or by creating a temp table that can persist the aggregated results that will be displayed. Row level data would not be retrieved until the user executes a drill-down action in the app. Here is an example in the RStudio database site: http://db.rstudio.com/best-practices/dashboards/

MarkeD · December 2, 2017, 10:05am

I use BigQuery a lot for my Shiny dashbaords, and it sounds like your queries are small enough that there won't be an unreasonable delay before seeing data - for daily trends this year perhaps a few thousand rows which should take a couple of seconds to load up, but there are some things that can help:

If you know beforehand what data you need, make a "report view" that has the aggregated data to query
You can cache the results to disk using saveRDS() and readRDS() for the first fetch of the day into a shared disk - either on the first load of the app or a cron job that runs in the morning.
If its for multiple Shiny apps, you get a lot quicker downloads from Google cloud storage than a BigQuery fetch, so consider saving the cache .rds files to there
Running the Shiny app on google compute engine for BigQuery apps has much lower latency than using AWS or Digital Ocean etc.

spiritus87 · December 2, 2017, 10:05am

Hi Edgar,

Great link, thanks! But suppose I wanted to persist the aggregated data in a table that is queried by the app. Then I would want these queries to be as fast as possible. What would be the best database solutions to store these precomputed tables to display in a shiny app? I've heard good things about monetDB but perhaps people here have more insight.

To compare, for the main website at my company we have our data in sql server but we aggregate it into an apache solr database to display it to our users as it's faster. So I wonder if it would be useful to have a similar flow for a shiny app - e.g., precompute the table in bigquery and store it in a database solution like monetDB because it offers more performance?

Riccardo.

spiritus87 · December 3, 2017, 10:55am

Hi @MarkeD

Thanks a lot for your suggestions! I'll definitely try them out. It's true that this present app doesn't need much data but we have a lot more data in there so I want to make sure I have as scalable a flow as possible

One thing however: as far as I understand big query views are not materialised, so the query would be run by every app instance, which could get expensive in the long run (unless you cache the data somewhere locally every day). Ever had problems with this?

Ciao R.

edgararuiz · December 4, 2017, 1:56pm

Usually, persisting the results would be done inside the same database as the source if possible. In your case, can you materialize those views inside bigquery?

MarkeD · December 7, 2017, 11:34am

One thing however: as far as I understand big query views are not materialised, so the query would be run by every app instance, which could get expensive in the long run (unless you cache the data somewhere locally every day). Ever had problems with this?

Its true views are just SQL that runs underneath any SQL you do on the table, but the costs have been so minimal I've not needed to optimise this, (repeat queries are cached at BQs end too) however I can see if you are running a lot of TB level queries then this may be an issue, in which case I would create a new aggregated table rather than a view, and use a cron job or something like DataPrep to schedule a report table for your Shiny app