I need some help related to R programming. It would be really nice if someone could help me. I am a beginner in R language.
I have one FACT table in SQL Server database, which will have around 10 million records. I am planning to fetch entire data through dbGetQuery and store it in a Data Table. Then perform some cleaning, create derived columns and at the end process the data.
I am little confused how should I approach this problem.
If someone has experience or have done similar thing, please let me know.
I would check out this resource first. From there, here is how I would break down your task:
do any filtering of records you would be removing in R before you bring the data into R. Let the database do that work for you
More generally, any work that you can have the database do for your prior to R is going to help
within dplyr, once you connect to a remote database, you can use the verbs to to operate on the data within the database as if it were a native R dataframe.
While I am guilty of doing the same thing (on much larger datasets), I would certainly attempt to keep as much of the data cleaning/filtering as you can within the SQL Server database as you can in order to avoid the overhead of data going over the wire. Basically, I would hesitate to just SELECT * the data into your R session.
I think it will be easier for people here to help if you can narrow down your question a little bit. What are you looking for help with first?
Constructing the code to connect to and query the database?
Advice on how feasible your plan is performance-wise, or other approaches you might consider?
Pointers to resources for learning how to wrangle data with data.table? (when I’m in data.table land I always keep a cheat sheet handy!)
Specific help with data.table code?
Something else?
An important note: if you have questions about specific code, it’s going to be easiest to get help if you can provide an as-reproducible-as-possible example of what you’ve got so far, even if it’s not working right.
I have one FACT table and various dimension tables. I have created one query to do inner join with dimensions tables to finally generating the data with required columns.
I am using this query to fetch the data in R. There is no filter which we can apply as we have to load entire year data.
library(dbplyr)
library(DBI)
library(dplyr)
library(magrittr)
library(RODBC)
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "xyz",
Database = "xyz",
UID = "xyz",
PWD = "xyz",
Port = xyz)
dbGetQuery(con,'Query')
This is something which I have created till now. But the problem is this query is time consuming because of many joins and volume of data.
Yes, ideally, instead of downloading data to R and then exploring it, it would be better to explore the data where it sits, in other words, use R to push the aggregations and filters to be produced in inside the Database and the pull down to R only the results. Here is an overview of that technique: http://db.rstudio.com/overview . The central idea is to use dplyr verbs, that in the background get translated into vendor-appropriate SQL statements. The entire http://db.rstudio.com/ site is dedicate to help you work with R and databases. Another resource that may help you is a webinar we gave back in March about this subject: https://www.rstudio.com/resources/videos/best-practices-for-working-with-databases-webinar/ In it, I used a MS SQL database for the demo.
You should go through dbplyr route. It will help you write queries and if in future you decide to switch to a different database you will not have to learn a different syntax.
Than after when you fetch the data. Use setDT() function to turn the same data frame into data table. This is a memory efficient command and doesn't create a copy of the data frame.
And I believe data.table is more than enough to handle 10 million rows. But I would still advice to get only that much data into R which is necessary. I mean you can always group by somethings.