Connecting R with Database | Large Query Result

Hi everyone

I need some help related to R programming. It would be really nice if someone could help me. I am a beginner in R language.

I have one FACT table in SQL Server database, which will have around 10 million records. I am planning to fetch entire data through dbGetQuery and store it in a Data Table. Then perform some cleaning, create derived columns and at the end process the data.

I am little confused how should I approach this problem.

If someone has experience or have done similar thing, please let me know.

Thanks a lot :slight_smile:

I would check out this resource first. From there, here is how I would break down your task:

  • do any filtering of records you would be removing in R before you bring the data into R. Let the database do that work for you
  • More generally, any work that you can have the database do for your prior to R is going to help
  • within dplyr, once you connect to a remote database, you can use the verbs to to operate on the data within the database as if it were a native R dataframe.

While I am guilty of doing the same thing (on much larger datasets), I would certainly attempt to keep as much of the data cleaning/filtering as you can within the SQL Server database as you can in order to avoid the overhead of data going over the wire. Basically, I would hesitate to just SELECT * the data into your R session.


Hi! Welcome!

I think it will be easier for people here to help if you can narrow down your question a little bit. What are you looking for help with first?

  • Constructing the code to connect to and query the database?
  • Advice on how feasible your plan is performance-wise, or other approaches you might consider?
  • Pointers to resources for learning how to wrangle data with data.table? (when I’m in data.table land I always keep a cheat sheet handy!)
  • Specific help with data.table code?
  • Something else?

An important note: if you have questions about specific code, it’s going to be easiest to get help if you can provide an as-reproducible-as-possible example of what you’ve got so far, even if it’s not working right.

1 Like

Hi Brock

Thanks a lot for your reply :slight_smile:

I have one FACT table and various dimension tables. I have created one query to do inner join with dimensions tables to finally generating the data with required columns.
I am using this query to fetch the data in R. There is no filter which we can apply as we have to load entire year data.


con <- DBI::dbConnect(odbc::odbc(),
                      Driver    = "SQL Server", 
                      Server    = "xyz",
                      Database  = "xyz",
                      UID       = "xyz",
                      PWD       = "xyz",
                      Port      = xyz)


This is something which I have created till now. But the problem is this query is time consuming because of many joins and volume of data.

So was confused how should I approach it


Thanks a lot for your reply :slight_smile:
I have tried to explain my problem again in one of the above reply.

Could you please see if this makes sense to you now :slight_smile:

Basically wanted to know how to handle queries which return huge volume of data.

Thank you

Yes, ideally, instead of downloading data to R and then exploring it, it would be better to explore the data where it sits, in other words, use R to push the aggregations and filters to be produced in inside the Database and the pull down to R only the results. Here is an overview of that technique: . The central idea is to use dplyr verbs, that in the background get translated into vendor-appropriate SQL statements. The entire site is dedicate to help you work with R and databases. Another resource that may help you is a webinar we gave back in March about this subject: In it, I used a MS SQL database for the demo.


Thanks a lot for your reply :slight_smile:
I will go through the links which you have shared.

You should go through dbplyr route. It will help you write queries and if in future you decide to switch to a different database you will not have to learn a different syntax.

Than after when you fetch the data. Use setDT() function to turn the same data frame into data table. This is a memory efficient command and doesn't create a copy of the data frame.

And I believe data.table is more than enough to handle 10 million rows. But I would still advice to get only that much data into R which is necessary. I mean you can always group by somethings.

Hope it helps