Mimicking dbplyr's laziness

boshek · February 9, 2019, 12:18am

Hi there,

I am trying untangle what happens in dbplyr to accomplish it's great lazy behaviour. Specifically I am trying to figure out how dplyr verbs "know" how to delay evaluation and only send the SQL query to the database as late as possible. I've got down to a part that seems magical and I was hoping someone here could explain how and when what I am getting is then translated in SQL.

So first we need to create a database and load dplyr and dbplyr:

library(dplyr)
library(dbplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")

copy_to(con, nycflights13::flights, "flights",
        temporary = FALSE, 
        indexes = list(
          c("year", "month", "day"), 
          "carrier", 
          "tailnum",
          "dest"
        )
)

So here is the example of the filter method in dbplyr "catching" the evaluation in flight_table and adding the deptime == "517" filter. Also include here is the specification of the query (flights_table$ops) before that filter:

flights_table <- tbl(con, "flights")

flights_table$ops
#> From: flights
#> <Table: flights>

flights_table %>% 
  dbplyr:::filter.tbl_lazy(dep_time == "517") 
#> # Source:   lazy query [?? x 19]
#> # Database: sqlite 3.22.0 [:memory:]
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     1      517            515         2      830
#> 2  2013    10    20      517            520        -3      754
#> 3  2013    11     2      517            515         2      806
#> 4  2013    11    10      517            515         2      749
#> 5  2013    12     6      517            515         2      813
#> 6  2013    12    11      517            515         2      825
#> 7  2013    12    29      517            500        17      712
#> 8  2013     2    24      517            515         2      816
#> # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dbl>

Turns out that within the filter method there is the add_op_single function which returns the filtered table as per the lazy SQL query:

#debugonce(dbplyr:::filter.tbl_lazy)

## Inside the filter method
dots <- quos(dep_time == "517")
dots <- partial_eval(dots, vars = op_vars(flights_table))
#debugonce(add_op_single)
add_op_single("filter", flights_table, dots = dots)
#> # Source:   lazy query [?? x 19]
#> # Database: sqlite 3.22.0 [:memory:]
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     1      517            515         2      830
#> 2  2013    10    20      517            520        -3      754
#> 3  2013    11     2      517            515         2      806
#> 4  2013    11    10      517            515         2      749
#> 5  2013    12     6      517            515         2      813
#> 6  2013    12    11      517            515         2      825
#> 7  2013    12    29      517            500        17      712
#> 8  2013     2    24      517            515         2      816
#> # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dbl>

Drilling down even further inside add_op_single, we find op_single adding the information to generate the query somehow the original flights_table.

## Inside add_op_single
flights_table$ops <- op_single("filter", x = flights_table$ops, dots = dots, 
                       args = list())

Which looks like this:

flights_table$ops
#> From: flights
#> <Table: flights>
#> -> filter()
#>    - ~dep_time == "517"

^{Created on 2019-02-08 by the reprex package (v0.2.1)}

Question

So if you made it this far, my question is how does one go from flights_table$ops to <SQL> SELECT * FROM 'flights' WHERE (dep_time= '517') and how and when is that sent to the database? I must be missing some steps here but I can't seem to figure out when the SQL is evaluated and when it is translated from flights_table$ops. Any insight here?

Thanks in advance,

Sam

Moody_Mudskipper · February 9, 2019, 6:46am

The print method sends the command to the DB, with a limitation on rows returned. In this case it doesn't show you what's in your object, but a glimpse of what you would get if you collect it.

Using print to explore objects can be deceptive. Better use str, attributes, unclass...

privacy · February 9, 2019, 7:48pm

Inside RStudio, another option to view/explore object is a View(obj); it will open in a tab of the files pane, and it's like str but interactive, allowing you to open/close sub-objects at each layer;
I find the interactive behavior super useful with deep objects, as str only allows you to bound the depth over all layers, rather than "dig in" just one sub-element that you're interest in.
(I think that by default it doesn't show attributes but there is a checkbox or something like that to enable the "show attributes" feature.)

boshek · February 11, 2019, 7:24pm

This is all very helpful (thanks!) and yes I clearly should have been using str, attributes etc.

Still though I am completely baffled still how intermediate SQL statements are sent to the DB from print methods with the object created by tbl (ie. before collect). Here are the classes for an object created by tibble:

flights_table <- tbl(con, "flights")

attributes(flights_table)
#> $names
#> [1] "src" "ops"
#> 
#> $class
#> [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
#> [4] "tbl_lazy"             "tbl"

If the print methods are sending the query to the db I must missing something as I can't find any print methods for the classes returns by tbl:

tbl_class <- class(flights_table)

paste0("print.", tbl_class) %in% ls(getNamespace("dbplyr"), all.names = TRUE)
#> [1] FALSE FALSE FALSE FALSE FALSE


paste0("print.", tbl_class) %in% ls(getNamespace("dplyr"), all.names = TRUE)
#> [1] FALSE FALSE FALSE FALSE FALSE

Again I feel like I am missing something obvious here but if there aren't print methods for the object returned by tbl then how is the query getting sent to the database?

boshek · February 13, 2019, 11:30pm

@Moody_Mudskipper any insight here? Thanks for your previous comments.

Moody_Mudskipper · February 14, 2019, 5:53pm

It's because it lives in tibble, though it doesn't need to be attached.

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attachement du package : 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
copy_to(con, nycflights13::flights, "flights")
flights_table <- tbl(con, "flights")
tbl_class <- class(flights_table)
l <- sapply(loadedNamespaces(),function(x){
  methods <- paste0("print.", tbl_class)
  methods[methods %in% ls(getNamespace(x), all.names = TRUE)]
})
l[lengths(l)==1]
#> $tibble
#> [1] "print.tbl"
   
tibble:::print.tbl
#> function (x, ..., n = NULL, width = NULL, n_extra = NULL) 
#> {
#>     cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
#>     invisible(x)
#> }
#> <bytecode: 0x0000000019a8a910>
#> <environment: namespace:tibble>
tibble:::format.tbl
#> function (x, ..., n = NULL, width = NULL, n_extra = NULL) 
#> {
#>     mat <- trunc_mat(x, n = n, width = width, n_extra = n_extra)
#>     format(mat)
#> }
#> <bytecode: 0x00000000196565c8>
#> <environment: namespace:tibble>
tibble:::trunc_mat
#> function (x, n = NULL, width = NULL, n_extra = NULL) 
#> {
#>     rows <- nrow(x)
#>     if (is_null(n)) {
#>         if (is.na(rows) || rows > tibble_opt("print_max")) {
#>             n <- tibble_opt("print_min")
#>         }
#>         else {
#>             n <- rows
#>         }
#>     }
#>     n_extra <- n_extra %||% tibble_opt("max_extra_cols")
#>     df <- as.data.frame(head(x, n))
#>     shrunk <- shrink_mat(df, rows, n, star = has_rownames(x))
#>     trunc_info <- list(width = width, rows_total = rows, rows_min = nrow(df), 
#>         n_extra = n_extra, summary = tbl_sum(x))
#>     structure(c(shrunk, trunc_info), class = c(paste0("trunc_mat_", 
#>         class(x)), "trunc_mat"))
#> }
#> <bytecode: 0x0000000019181298>
#> <environment: namespace:tibble>
tibble:::print.tbl
#> function (x, ..., n = NULL, width = NULL, n_extra = NULL) 
#> {
#>     cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
#>     invisible(x)
#> }
#> <bytecode: 0x0000000019a8a910>
#> <environment: namespace:tibble>
dbplyr:::head.tbl_lazy
#> function (x, n = 6L, ...) 
#> {
#>     if (inherits(x$ops, "op_head")) {
#>         x$ops$args$n <- min(x$ops$args$n, n)
#>     }
#>     else {
#>         x$ops <- op_single("head", x = x$ops, dots = dots, args = list(n = n))
#>     }
#>     x
#> }
#> <bytecode: 0x0000000017aba0b0>
#> <environment: namespace:dbplyr>

Created on 2019-02-14 by the reprex package (v0.2.0).

What it does ultimately is add the SQL code to limit the number of lines returned, if you run a query with a group by you're preview might take a long time to be returned

privacy · February 17, 2019, 8:05pm

Hmm, really interesting. I guess that this happens only because an attached package uses functionality from other packages without attaching them, which means that even if we use the pryr::where function, we will not see the print.tbl function. The only hint I can see for its existence is when I type "print.tbl" in the help window (equivalent to ?? print.tbl), it shows me a page titled " Tools for describing matrices", that has a version of print for tbl:

## S3 method for class 'tbl'
print(x, ..., n = NULL, width = NULL, n_extra = NULL)

but also at the end fo this page:

See Also
tibble-package

When you click on tibble-package link you can see that function resides there:

Methods
tbl_df implements four important base methods:

print
By default only prints the first 10 rows (at most 20), and the columns that fit on screen; see print.tbl()

[
Does not simplify (drop) by default, returns a data frame

[[, $
Calls .subset2() directly, so is considerably faster. Returns NULL if column does not exist, $ warns.

A pretty long way to go to find where the function resides, but at least there are links that lead you there...

Will appreciate to know if there are alternatives, easier ways to find if there is such a function and where it belongs...

Moody_Mudskipper · February 18, 2019, 1:31am

I'm confused, I think my code gives you exactly that with :

 l <- sapply(loadedNamespaces(),function(x){
  methods <- paste0("print.", tbl_class)
  methods[methods %in% ls(getNamespace(x), all.names = TRUE)]
})
l[lengths(l)==1]

Can you elaborate on what you'd want that this doesn't provide ?

privacy · February 18, 2019, 6:03pm

Sorry, my bad; that statement of your code indeed provides me with the information. Sorry I missed that...

system · May 14, 2019, 8:14pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.