Summary() but on the database side of the wire

jdlong · January 14, 2018, 3:54am

Hey y'all I'm sitting here on a Saturday night thinking about summarizing data, as one does. I regularly have a big table in Amazon Redshift which I'd love to get summary data from describing each field the way summary() in base R or skim() does. But my data is big enough that sucking it into R would take hours. So I'm thinking about using dplyr to cook up some code to effectively execute summary stuff, but do it over in the cloud and return only what's needed for the output. (I was inspired by this blog post discussing all the different summary functions)

Before I start writing crappy code in anger, do any of you know of anyone or any packages already doing this sort of foolishness? I really hate reinventing the wheel if possible.

many thanks!

elben10 · January 14, 2018, 1:49pm

You could create an issue in skimr. It uses dplyr so it shouldn't be too difficult to implement something similar using dbplyr.

jdlong · January 14, 2018, 2:56pm

good idea... I didn't realize that skimr uses dplyr. I'll investigate.

I think the big challenge with pushing the logic to the DB is much of the logic would have to be re-implemented using the limited syntax of SQL. So, for example, to calculate quantiles one would have to count all records, then get the value at different points in the distribution... e.g. get the value at 1/5 n then 2/5 n ...

dickoah · January 14, 2018, 6:07pm

Nice, I woudl also like to see if there's any code or package to do this. Just for the specific case of quantile, I think you can use percentile_cont (available since Postgres 9.4) to compute quantile, not sure if it's as precise as the R version of quantile though.

library(tidyverse)
library(dbplyr)
library(RPostgres)
library(DBI)

con <- dbConnect(RPostgres::Postgres(), dbname = "testdb")
copy_to(con, rownames_to_column(mtcars), name = "mtcars")

req <- "SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY mpg) AS q1 FROM mtcars"
tbl(con, sql(req))
##      q1
##   <dbl>
## 1  15.4

mtcars %>%
  summarise(q1 = quantile(mpg, probs = 0.25))
##       q1
## 1 15.425

On a side note, it would be great if someone can point me to mecanism to add specific aggregate function to an already existing DBI backend. Using the help from sql_variant, I came up with this

pg_quantile <-  function(x, probs, ...) {
  stopifnot(length(probs) == 1) ## 1 parameters supported
  sql(str_glue("PERCENTILE_CONT({probs}) WITHIN GROUP (ORDER BY {x})", x = x, probs = probs))
}

postgres_agg <- sql_translator(.parent = base_agg,
                              quantile = pg_quantile)

postgres_var <- sql_variant(
  base_scalar,
  postgres_agg,
  base_no_win
)

conn <- structure(
  list(),
  class = c("TestCon", "DBITestConnection", "DBIConnection")
)

sql_translate_env.TestCon <- function(x) postgres_var
translate_sql(quantile(x, probs = 0.5), con = conn, window = FALSE)
## <SQL> PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY `x`)

It will be great to have a way to use this new translation directly with dplyr, any doc available ?

Thanks

jdlong · January 15, 2018, 12:42am

I love the idea of using percentile_cont because it simplifies things, and I'm lazy. But then the package would be dependant on postgres and not generally applicable to all SQL backends. I'm inclined to limit myself to the SQL functions supported in dblyr, just for compatibility. Although that will be more work to code up.

edgararuiz · January 15, 2018, 2:23pm

Morning, for the histogram you could use dbplot . It provides a quick and easy ggplot2 output, or it can return the bin+count in a data.frame, all operated within the database. It works with any database that dbplyr works with.

jdlong · January 15, 2018, 2:25pm

neat project! Thank you. Is it possible to get the bin values from the plot? Are those accessible to the user?

edgararuiz · January 15, 2018, 2:28pm

Thanks! And yes, all of the underpinning calculations are availabe in their own discrete functions: https://github.com/edgararuiz/dbplot#calculation-functions

jdlong · January 15, 2018, 3:30pm

Thanks Edgar. that's very helpful. I'll take a look at your code in the near future and see what I can cobble together.

kylelundstedt · January 20, 2018, 9:33pm

Integrating the dbplot-style calculations into skimr would be great! Getting skimr-style summaries via dbplyr against Redshift and the like would make EDA so much easier.

jdlong · January 30, 2018, 6:53pm

that's exactly the use case I had in mind, @kylelundstedt!

jdlong · December 2, 2018, 8:38pm

It's been almost a year, but I'm back to thinking about this. I was hoping I'd drag my feet and someone would just write this for me. But, alas, no luck. If anyone has ideas about porting skimr to play well with dbplyr, please let me know. I opened a discussion on the skimr github page as well: https://github.com/ropensci/skimr/issues/358