Efficiency in dbplyr - COVID19 Research

Names: David
From: Government


I am having trouble with database administers over data pulls with dbplyr. They maintain I am querying tables repeatedly and unnecessarily, but I can't figure out how to do it differently with dbplyr.

Let's say I have a cohort from table A. In tables B and C, I would like only the information for those that appear in table A. My approach was to start with table A, then left_join and select from table B. Does this approach repeatedly draw from table A? Put differently, if I left join to a tbl, does this tbl get drawn then filtered, or is there a "where" statement so only the joined rows are pulled?


2 Likes

it's a bit difficult to answer your question fully since it depends highly on your infrastructure. However, in a general case dplyr operations are "lazy" meaning that things are computed only when you ask for it explicitly (for example when you run dplyr::collect or something similar).

To illustrate that here is a small reprex:

library(dbplyr)
library(magrittr)

a <- memdb_frame(iris %>% dplyr::mutate(id = 1:nrow(iris)))

b <- memdb_frame(mtcars %>% dplyr::mutate(id = 1:nrow(mtcars)))

a %>%
  dplyr::left_join(b, by = "id") %>%
  dplyr::show_query()
#> <SQL>
#> SELECT `LHS`.`Sepal.Length` AS `Sepal.Length`, `LHS`.`Sepal.Width` AS `Sepal.Width`, `LHS`.`Petal.Length` AS `Petal.Length`, `LHS`.`Petal.Width` AS `Petal.Width`, `LHS`.`Species` AS `Species`, `LHS`.`id` AS `id`, `RHS`.`mpg` AS `mpg`, `RHS`.`cyl` AS `cyl`, `RHS`.`disp` AS `disp`, `RHS`.`hp` AS `hp`, `RHS`.`drat` AS `drat`, `RHS`.`wt` AS `wt`, `RHS`.`qsec` AS `qsec`, `RHS`.`vs` AS `vs`, `RHS`.`am` AS `am`, `RHS`.`gear` AS `gear`, `RHS`.`carb` AS `carb`
#> FROM `dbplyr_001` AS `LHS`
#> LEFT JOIN `dbplyr_002` AS `RHS`
#> ON (`LHS`.`id` = `RHS`.`id`)

Created on 2020-05-12 by the reprex package (v0.3.0)

As you can see the query is created, but there is nothing computed. Once you run dplyr::collect or dplyr::compute, then query will actually run in a DB.

So when your colleagues are saying that you "pull" data repeatedly I'm not sure what exactly they mean by that.

1 Like

Thanks Misha. My question is more what happens with collect(), and how I can make it more efficient once executed. Let's say "b" is massive, and I only want rows from "b" with a match in "a", and only a select number of columns from "b":

a %>%
  dplyr::left_join(b %>% select(id)) %>%
  dplyr::show_query()

Then there is a select statement for the full "b" table. What is the easiest way to only select rows from "b" in which there is a match in "a"?

Well, that is where "depends on your infrastructure" thing comes into play. Almost every DB that I know of will take your query and optimize it in some way. For example if it is obvious that you only need id column from b table and only certain rows from there then only those rows will be collect'ed.

What exactly is the problem DBA's in your organization are having?

It's in SQL Server. You can take "a" to be a table of visits, "b" to be table of diagnosis, and "c" to be a table of patient information. They will all be linked by patient ID, but "a" and "b" have to be linked back to "c" to join together. Basically, I want a subset of "a", then the parts of "b" and "c" that pertain to them. DBAs are saying that I am repeatedly pulling all of "c" even with one collect statement. The psuedocode is:

my_visits <- a %>% 
       filter(cond =="something") %>%
       left_join(c %>% select(patientID), by = c("PatientID" = "key"))

diagnoses <- b %>%
       left_join(c %>% select(patientID), by = c("PatientID" = "key2"))

all <- my_visits %>% 
     left_join(diagnoses) %>%
     collect()

This seems to repeatedly SELECT * from "c", and I'm not sure how to avoid that. Apologies if it's obvious and thank you for your help.

1 Like

Can you try replicating the steps with reprex the same way I'm doing it? Because I don't see SELECT * anywhere in the query that dbplyr generates for me here (I don't have SQL Server on hand, so difficult to replicate what you are seeing):

library(dbplyr)
library(magrittr)

a <- memdb_frame(iris %>% dplyr::mutate(id = 1:nrow(iris)))

b <- memdb_frame(mtcars %>% dplyr::mutate(id = 1:nrow(mtcars)))

c <- memdb_frame(tibble::tibble(patient_id = 1:200, some_other_column = 1L))

table1 <- a %>%
  dplyr::left_join(c %>% dplyr::select(patient_id), by = c("id" = "patient_id")) 

table2 <- b %>%
  dplyr::left_join(c %>% dplyr::select(patient_id), by = c("id" = "patient_id")) 

table1 %>%
  dplyr::left_join(table2, by = "id") %>%
  dplyr::show_query()
#> <SQL>
#> SELECT `LHS`.`Sepal.Length` AS `Sepal.Length`, `LHS`.`Sepal.Width` AS `Sepal.Width`, `LHS`.`Petal.Length` AS `Petal.Length`, `LHS`.`Petal.Width` AS `Petal.Width`, `LHS`.`Species` AS `Species`, `LHS`.`id` AS `id`, `RHS`.`mpg` AS `mpg`, `RHS`.`cyl` AS `cyl`, `RHS`.`disp` AS `disp`, `RHS`.`hp` AS `hp`, `RHS`.`drat` AS `drat`, `RHS`.`wt` AS `wt`, `RHS`.`qsec` AS `qsec`, `RHS`.`vs` AS `vs`, `RHS`.`am` AS `am`, `RHS`.`gear` AS `gear`, `RHS`.`carb` AS `carb`
#> FROM (SELECT `LHS`.`Sepal.Length` AS `Sepal.Length`, `LHS`.`Sepal.Width` AS `Sepal.Width`, `LHS`.`Petal.Length` AS `Petal.Length`, `LHS`.`Petal.Width` AS `Petal.Width`, `LHS`.`Species` AS `Species`, `LHS`.`id` AS `id`
#> FROM `dbplyr_001` AS `LHS`
#> LEFT JOIN (SELECT `patient_id`
#> FROM `dbplyr_003`) AS `RHS`
#> ON (`LHS`.`id` = `RHS`.`patient_id`)
#> ) AS `LHS`
#> LEFT JOIN (SELECT `LHS`.`mpg` AS `mpg`, `LHS`.`cyl` AS `cyl`, `LHS`.`disp` AS `disp`, `LHS`.`hp` AS `hp`, `LHS`.`drat` AS `drat`, `LHS`.`wt` AS `wt`, `LHS`.`qsec` AS `qsec`, `LHS`.`vs` AS `vs`, `LHS`.`am` AS `am`, `LHS`.`gear` AS `gear`, `LHS`.`carb` AS `carb`, `LHS`.`id` AS `id`
#> FROM `dbplyr_002` AS `LHS`
#> LEFT JOIN (SELECT `patient_id`
#> FROM `dbplyr_003`) AS `RHS`
#> ON (`LHS`.`id` = `RHS`.`patient_id`)
#> ) AS `RHS`
#> ON (`LHS`.`id` = `RHS`.`id`)

Created on 2020-05-12 by the reprex package (v0.3.0)

Here is a reprex, but I also don't have a way to get to sql server locally. What ends up happening is that I get a lot of statements that look like LEFT JOIN (SELECT TOP 100 PERCENT ...

library(dbplyr)
library(tidyverse)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

visits <- data.frame(key = c(1:10),
                     visit_date = c(51:60))

diagnoses <- data.frame(key2 = c(1:10), diag = c("a", "a", "b", "b", "c", "d", "e", "r", "e", "t"))

patinfo <- data.frame(patid = c(1:10),  key = c(1:10), key2 = c(1:10))

copy_to(con, visits, "a",
        temporary = FALSE, 
        overwrite=T
)
copy_to(con, diagnoses, "b",
        temporary = FALSE, 
        overwrite=T
)
copy_to(con, patinfo, "c",
        temporary = FALSE, 
        overwrite=T
)

my_visits <- tbl(con, "a") %>% 
  filter(visit_date >= 55) %>% 
  left_join(tbl(con, "c") %>% dplyr::select(patid), by = c("key"= "patid")) %>% 
  rename(patid = key)

diagnoses <- tbl(con, "b") %>% 
  left_join(tbl(con, "c") %>% dplyr::select(patid), by = c("key2"= "patid")) %>% 
  rename(patid = key2)

all <- my_visits %>% 
  left_join(diagnoses) %>%
  show_query()
1 Like

That's what I see for your final query:

Joining, by = "patid"
<SQL>
SELECT `LHS`.`patid` AS `patid`, `LHS`.`visit_date` AS `visit_date`, `RHS`.`diag` AS `diag`
FROM (SELECT `key` AS `patid`, `visit_date`
FROM (SELECT `LHS`.`key` AS `key`, `LHS`.`visit_date` AS `visit_date`
FROM (SELECT *
FROM `a`
WHERE (`visit_date` >= 55.0)) AS `LHS`
LEFT JOIN (SELECT `patid`
FROM `c`) AS `RHS`
ON (`LHS`.`key` = `RHS`.`patid`)
)) AS `LHS`
LEFT JOIN (SELECT `key2` AS `patid`, `diag`
FROM (SELECT `LHS`.`key2` AS `key2`, `LHS`.`diag` AS `diag`
FROM `b` AS `LHS`
LEFT JOIN (SELECT `patid`
FROM `c`) AS `RHS`
ON (`LHS`.`key2` = `RHS`.`patid`)
)) AS `RHS`
ON (`LHS`.`patid` = `RHS`.`patid`)

This doesn't look too bad, tbh, so not sure what could be improved. One thing that you could always try is to use dplyr::compute to create a temporary table that you could re-use in different steps. This might help a bit.

I still think that long queries are not exactly bad by itself since every DB will optimize the query anyways. So whatever problems DBA's have might be with something else.

2 Likes

Thanks so much for your help Misha. In your view is it generally better to shorten the query and use temp tables, or put it all through in one go with collect()? Once again I really appreciate your help.

The answer depends a lot on DB you are using. The easiest way to find out is to try out and see which one works better. I've had a situation where two identical dplyr chains worked completely different on two different DB's.

Oh, and one more thing that might be worth checking out is the version of dbplyr you have. If you see a lot of SELECT * then it is likely that your version is too old. There was an issue where sub-queries were generated with SELECT * instead of what it's doing now where all columns are explicitly SELECTed.

1 Like

One additional tool to be aware of here is dplyr::explain(), which provides output similar to dplyr::show_query() but also includes a more detailed query plan. This query plan can provide further clarity into how the query will be executed in the database:

library(dbplyr)
library(tidyverse)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

visits <- data.frame(key = c(1:10),
                     visit_date = c(51:60))

diagnoses <- data.frame(key2 = c(1:10), diag = c("a", "a", "b", "b", "c", "d", "e", "r", "e", "t"))

patinfo <- data.frame(patid = c(1:10),  key = c(1:10), key2 = c(1:10))

copy_to(con, visits, "a",
        temporary = FALSE, 
        overwrite=T
)
copy_to(con, diagnoses, "b",
        temporary = FALSE, 
        overwrite=T
)
copy_to(con, patinfo, "c",
        temporary = FALSE, 
        overwrite=T
)

my_visits <- tbl(con, "a") %>% 
  filter(visit_date >= 55) %>% 
  left_join(tbl(con, "c") %>% dplyr::select(patid), by = c("key"= "patid")) %>% 
  rename(patid = key)

diagnoses <- tbl(con, "b") %>% 
  left_join(tbl(con, "c") %>% dplyr::select(patid), by = c("key2"= "patid")) %>% 
  rename(patid = key2)

all <- my_visits %>% 
  left_join(diagnoses) %>%
  explain()
#> Joining, by = "patid"
#> <SQL>
#> SELECT `LHS`.`patid` AS `patid`, `LHS`.`visit_date` AS `visit_date`, `RHS`.`diag` AS `diag`
#> FROM (SELECT `key` AS `patid`, `visit_date`
#> FROM (SELECT `LHS`.`key` AS `key`, `LHS`.`visit_date` AS `visit_date`
#> FROM (SELECT *
#> FROM `a`
#> WHERE (`visit_date` >= 55.0)) AS `LHS`
#> LEFT JOIN (SELECT `patid`
#> FROM `c`) AS `RHS`
#> ON (`LHS`.`key` = `RHS`.`patid`)
#> )) AS `LHS`
#> LEFT JOIN (SELECT `key2` AS `patid`, `diag`
#> FROM (SELECT `LHS`.`key2` AS `key2`, `LHS`.`diag` AS `diag`
#> FROM `b` AS `LHS`
#> LEFT JOIN (SELECT `patid`
#> FROM `c`) AS `RHS`
#> ON (`LHS`.`key2` = `RHS`.`patid`)
#> )) AS `RHS`
#> ON (`LHS`.`patid` = `RHS`.`patid`)
#> 
#> 
#> <PLAN>
#>   id parent notused                                                    detail
#> 1  3      0       0                                             MATERIALIZE 6
#> 2  7      3       0                                       SCAN TABLE b AS LHS
#> 3  9      3       0                                              SCAN TABLE c
#> 4 29      0       0                                              SCAN TABLE a
#> 5 33      0       0                                              SCAN TABLE c
#> 6 49      0       0 SEARCH SUBQUERY 6 USING AUTOMATIC COVERING INDEX (key2=?)
2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.