Hi there,
I am trying untangle what happens in dbplyr to accomplish it's great lazy behaviour. Specifically I am trying to figure out how dplyr
verbs "know" how to delay evaluation and only send the SQL query to the database as late as possible. I've got down to a part that seems magical and I was hoping someone here could explain how and when what I am getting is then translated in SQL.
So first we need to create a database and load dplyr
and dbplyr
:
library(dplyr)
library(dbplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
copy_to(con, nycflights13::flights, "flights",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
So here is the example of the filter method in dbplyr
"catching" the evaluation in flight_table
and adding the deptime == "517"
filter. Also include here is the specification of the query (flights_table$ops
) before that filter:
flights_table <- tbl(con, "flights")
flights_table$ops
#> From: flights
#> <Table: flights>
flights_table %>%
dbplyr:::filter.tbl_lazy(dep_time == "517")
#> # Source: lazy query [?? x 19]
#> # Database: sqlite 3.22.0 [:memory:]
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 10 20 517 520 -3 754
#> 3 2013 11 2 517 515 2 806
#> 4 2013 11 10 517 515 2 749
#> 5 2013 12 6 517 515 2 813
#> 6 2013 12 11 517 515 2 825
#> 7 2013 12 29 517 500 17 712
#> 8 2013 2 24 517 515 2 816
#> # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dbl>
Turns out that within the filter method there is the add_op_single
function which returns the filtered table as per the lazy SQL query:
#debugonce(dbplyr:::filter.tbl_lazy)
## Inside the filter method
dots <- quos(dep_time == "517")
dots <- partial_eval(dots, vars = op_vars(flights_table))
#debugonce(add_op_single)
add_op_single("filter", flights_table, dots = dots)
#> # Source: lazy query [?? x 19]
#> # Database: sqlite 3.22.0 [:memory:]
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 10 20 517 520 -3 754
#> 3 2013 11 2 517 515 2 806
#> 4 2013 11 10 517 515 2 749
#> 5 2013 12 6 517 515 2 813
#> 6 2013 12 11 517 515 2 825
#> 7 2013 12 29 517 500 17 712
#> 8 2013 2 24 517 515 2 816
#> # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dbl>
Drilling down even further inside add_op_single
, we find op_single
adding the information to generate the query somehow the original flights_table
.
## Inside add_op_single
flights_table$ops <- op_single("filter", x = flights_table$ops, dots = dots,
args = list())
Which looks like this:
flights_table$ops
#> From: flights
#> <Table: flights>
#> -> filter()
#> - ~dep_time == "517"
Created on 2019-02-08 by the reprex package (v0.2.1)
Question
So if you made it this far, my question is how does one go from flights_table$ops
to <SQL> SELECT * FROM 'flights' WHERE (
dep_time= '517')
and how and when is that sent to the database? I must be missing some steps here but I can't seem to figure out when the SQL is evaluated and when it is translated from flights_table$ops
. Any insight here?
Thanks in advance,
Sam