For a weekly updated Shiny Dashboard iam connecting to a MSSQL Server and perform some querys on different tables which are on the Server. For one of that queries i need to access 5 tables with 80.000.000 rows each. It takes up to 1H30M until that query is finished.
I would really like to measure the performance of my query in order to optimize it. I however dont know if really my query is the problem, or the size of the data, or the server, or...
While I was trying to improve my query nevertheless, I heard about indexes. 'Always use indexes' they say. So do i need to lern about indexes first independently from R in order to transfer my learnings to R? Or does it make sense to learn indexes using R?
There are not that many resources about indexing using R. However i found one in which the author performed indexing as follows: He made three tbl() calls to reference 3 tables:
tableOne = tbl(con, "TableOne") %>% filter(...)
tableTwo = tbl(con, "TableTwo") %>% mutate(...)
tableThree = tbl(con, "TableThree")
afterwards he performed his desired join.
tableFinal = tableOne %>%
semijoin(tableTwo) %>%
leftjoin(tableThree)
So thats what practical indexing is all about? I just make a reference to a table while performing some filters, mutates whatsoever, to really only have the data in my reference that i need?
And what is the difference between that approach and the example code from the dbplyr blog post?
copy_to(con, nycflights13::flights, "flights",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
Is that copy_to()
call equivalent to:
tbl(con, "flights") %>% select(year, month, day, carrier, tailnum, dest)
? Following the logic from the previous example this should be the same. Shouldnt it?
So this ended up, not being a single, specific question, but more a writing down of some ambiguities. I appreciate any kind of help and advice.