Hey y'all I'm sitting here on a Saturday night thinking about summarizing data, as one does. I regularly have a big table in Amazon Redshift which I'd love to get summary data from describing each field the way summary() in base R or skim() does. But my data is big enough that sucking it into R would take hours. So I'm thinking about using dplyr to cook up some code to effectively execute summary stuff, but do it over in the cloud and return only what's needed for the output. (I was inspired by this blog post discussing all the different summary functions)
Before I start writing crappy code in anger, do any of you know of anyone or any packages already doing this sort of foolishness? I really hate reinventing the wheel if possible.
good idea... I didn't realize that skimr uses dplyr. I'll investigate.
I think the big challenge with pushing the logic to the DB is much of the logic would have to be re-implemented using the limited syntax of SQL. So, for example, to calculate quantiles one would have to count all records, then get the value at different points in the distribution... e.g. get the value at 1/5 n then 2/5 n ...
Nice, I woudl also like to see if there's any code or package to do this. Just for the specific case of quantile, I think you can use percentile_cont (available since Postgres 9.4) to compute quantile, not sure if it's as precise as the R version of quantile though.
library(tidyverse)
library(dbplyr)
library(RPostgres)
library(DBI)
con <- dbConnect(RPostgres::Postgres(), dbname = "testdb")
copy_to(con, rownames_to_column(mtcars), name = "mtcars")
req <- "SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY mpg) AS q1 FROM mtcars"
tbl(con, sql(req))
## q1
## <dbl>
## 1 15.4
mtcars %>%
summarise(q1 = quantile(mpg, probs = 0.25))
## q1
## 1 15.425
On a side note, it would be great if someone can point me to mecanism to add specific aggregate function to an already existing DBI backend. Using the help from sql_variant, I came up with this
pg_quantile <- function(x, probs, ...) {
stopifnot(length(probs) == 1) ## 1 parameters supported
sql(str_glue("PERCENTILE_CONT({probs}) WITHIN GROUP (ORDER BY {x})", x = x, probs = probs))
}
postgres_agg <- sql_translator(.parent = base_agg,
quantile = pg_quantile)
postgres_var <- sql_variant(
base_scalar,
postgres_agg,
base_no_win
)
conn <- structure(
list(),
class = c("TestCon", "DBITestConnection", "DBIConnection")
)
sql_translate_env.TestCon <- function(x) postgres_var
translate_sql(quantile(x, probs = 0.5), con = conn, window = FALSE)
## <SQL> PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY `x`)
It will be great to have a way to use this new translation directly with dplyr, any doc available ?
I love the idea of using percentile_cont because it simplifies things, and I'm lazy. But then the package would be dependant on postgres and not generally applicable to all SQL backends. I'm inclined to limit myself to the SQL functions supported in dblyr, just for compatibility. Although that will be more work to code up.
Morning, for the histogram you could use dbplot . It provides a quick and easy ggplot2 output, or it can return the bin+count in a data.frame, all operated within the database. It works with any database that dbplyr works with.
Integrating the dbplot-style calculations into skimr would be great! Getting skimr-style summaries via dbplyr against Redshift and the like would make EDA so much easier.
It's been almost a year, but I'm back to thinking about this. I was hoping I'd drag my feet and someone would just write this for me. But, alas, no luck. If anyone has ideas about porting skimr to play well with dbplyr, please let me know. I opened a discussion on the skimr github page as well: https://github.com/ropensci/skimr/issues/358