Determining bottleneck of collect() function for database

docj · July 6, 2022, 11:36pm

Hi all.

I'm playing around with a local database where I filter an index column for n indices. The table I'm querying is 7.5 million lines and 13 columns. The code looks something like this:

pacman::p_load(DBI, tidyverse)
### We need a list of indices to query the data base with. 
### L is "database size" and n is number of indices
L <- 1000000
n <- 326
indexList <- seq(1:L, n)

### We query the database/table by passing indexList to dplyr::filter()
### and then we load the results into memory in R with dplyr::collect()
con <- dbConnect(RPostgres::Postgres(), dbname = "example_db")
tbl(con, "example_table") %>% filter(index %in% indexList) %>% collect()

Collecting the data takes about five seconds which is fine except that when I set up a similar database on my personal laptop running the code takes only 0.05 seconds. I also tried rerunning the query with dbGetQuery() but that wasn't any faster (or slower).

I'm guessing this has to do with the CPU type, type of hard drive, "closeness" of hardware, and other stuff. Is there any way for me to determine what the bottleneck is on the slower machine? I'm relatively new to benchmarking/databases so I would appreciate any input.

Thanks!

docj · July 7, 2022, 10:27am

I figured it out. The table was missing a primary key so there was no "structure" imposed on the data. Once I added BTree structure to the table querying and collecting took 0.02 seconds.

system · July 14, 2022, 10:27am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.