How to reduce dbplyr querying database during query-building phase?

I've been using dbplyr to build a complex query to access data in Snowflake. It works well but I notice it takes a long time to build the query, even before I collect(). It seems that at every step dbplyr needs to check with the database (Snowflake in this instance) things like column name, column types, and column contents (in the case of pivot_wider). Building the query is taking as long as actually executing it.

Are there any tips to reduce dbplyr's communication with the database prior to collect()? I noticed on Github someone said you can use pivot_wider_spec() to speed up pivot_wider() a bit, but this wouldn't save me much in this case as I only do pivot_wider() twice (although I do know the new column names in advance).

Any tips?

Edit: I do have several local tables used for joining tables in the database; I push them using copy_to(), I wonder if this makes it slow.

Edit2: One thing I am doing it expanding on time. I have tables that include a datetime (whole hours) and I want to expand the table to the full time series. I am creating the series of hours locally, using copy_to, and then left_joining the DB table. Is there a more efficient way?

Edit3: My bottleneck is this code:

weathnew4 <- weathnew3 %>%
  pivot_wider(names_from = "NAME", values_from = "VALUEI")

pivot_wider() is not lazy so it executed the (complex and slow) query weathnew3.

I know the values in the NAME column, they are stored in a character vector called "variables". I tried using pivot_wider_spec() which should be lazy, but I got an error.

spec <- tibble(NAME = variables, .name = variables, .value = "VALUEI")
weathnew4 <- weathnew3 %>%
  pivot_wider_spec(spec)

Error in `data[vec_unique(c(id_cols, names_from_cols, values_from_cols, unused_cols))]`:
! Can't subset columns with `vec_unique(c(id_cols, names_from_cols, values_from_cols, unused_cols))`.
✖ Subscript `vec_unique(c(id_cols, names_from_cols, values_from_cols, unused_cols))` can't contain missing values.
✖ It has a missing value at location 1.
Run `rlang::last_trace()` to see where the error occurred.

This generic approach almost worked, but something gets lost along the way. It runs but I get errors in subsequent steps.

names_str <- paste0("'", variables, "'", collapse = ", ")
weathnew3_sql <- dbplyr::remote_query(weathnew3) 
sql_str <- glue::glue("
  SELECT * FROM (
    {weathnew3_sql}
  )
  PIVOT (
    MAX(VALUEI)
    FOR NAME IN ({names_str})
  )")
weathnew4 <- dplyr::tbl(snowflake, sql(sql_str)) %>% 
  rename_with( ~ str_remove_all(., "'")) # remove quotes

I'm not using a con to a SQL for tables much bigger than 125Kx50 so my head may be screwed on differently. I seemed to have picked up the notion that dbplyr is sort of training wheels for native SQL syntax. If I were doing this, not knowing your particular approaches, and wanted to develop a list of queries, say, without all the traffic back and forth, I'd do something like

tables <- dbListTables(con)
schema <- lapply(tables, function(table) {
  fields <- dbListFields(con, table)
  data.frame(table = table, field = fields, stringsAsFactors = FALSE)
})
schema <- do.call(rbind, schema)

and create a local database with a handful of records and use it as a proxy for the target in getting down the syntax and playing around with the table operations. Then I'd set it loose with connect() on the target instance.

1 Like

That's a great tip. I should definitely do that.

There's an issue here about pivot_wider not ever being lazy. The proposed solution didn't work for me. In the future I think I will avoid using pivot_wider in dbplyr.

pivot_wider doesn't seem to be lazy · Issue #598 · tidyverse/dbplyr (github.com)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.