Combining R and RDBMS tables for data analysis

Udayasoorian · September 6, 2019, 8:55am

Hi All ,
I have created a multiple datasets in excel with each data set depend on other for its values. It would be easy for me if I create tables with primary key and foreign keys. insert the values in master tableand based on the values in master table I cud update the values in the derivative tables. after that I need to create graphs and do my analysis from those values . is it possible in R or should I create a dataset in excel each time ,calculate the derivative dataset , then load it in R and do my analysis. Could you kindly advise me?

jlacko · September 6, 2019, 9:58am

Joining data by key is easy in R - look for package dplyr, specifically left_join() and inner_join() functions.

If you are used to joining tables in SQL / RDBMS environment you will find the techniques familiar.

Also note that it is possible (although it takes some effort to set up) to access RDBMS data from R.

Udayasoorian · September 6, 2019, 10:48am

Hi Jlacko,
Thanks for your answer. So in such case if columns in two of my tables depends on my master table then I should do all calculations in Database itself, ater that I shouls pull the data from DB and then do my analysis in R .Am I right? I prefer RDBMS bcos my column value depends on other column value . if there any other is possible in R then please kindly enlighten me

my requirement is

I have a data set of daily values for various sectors namely a,b, c (it may go till 50)
then in another data set I will have monthly values which is cumulation of my daily values so I should sum of my daily values ..
then another data set will have cumulative values for quarter
I have to analyze each daily values individually for various sectors and also cumulative performance monthly for the same sectors. If I put it as Tables I can easily retrieve values and analyse it .

if there is any other way then please kindly enlighten me .
Thanks

andresrcs · September 6, 2019, 4:49pm

Calculating accumulations in advance an querying multiple tables (which are variations of the raw data) sound like an unnecessarily complicated workflow, once your raw data is imported into R, performing aggregations and data wrangling is pretty simple, see this example.

library(dplyr)
library(lubridate)
library(tibbletime)

# Sample data mimicking your DB
set.seed(1234)

sample_df <- data.frame(
    date = seq.Date(as.Date("2019-01-01"), as.Date("2019-04-30"), 1),
    a = rnorm(120),
    b = rnorm(120),
    c = rnorm(120)
)

# Monthly aggregation
sample_df %>% 
    as_tbl_time(date) %>% 
    collapse_by("month", side = "start", clean = TRUE) %>%
    group_by(date) %>% 
    summarise_all(sum) %>% 
    mutate(month = month(date, label = TRUE, abbr = FALSE)) %>% 
    select(month, everything(), -date)
#> # A tibble: 4 x 4
#>   month         a     b      c
#>   <ord>     <dbl> <dbl>  <dbl>
#> 1 January   -7.79  2.88  3.39 
#> 2 February -16.5   4.20  4.57 
#> 3 March      3.00  4.31  0.800
#> 4 April      3.35  2.17 -1.09

# Quarterly aggregation
sample_df %>%
    as_tbl_time(date) %>% 
    collapse_by("quarterly", side = "start", clean = TRUE) %>%
    group_by(date) %>% 
    summarise_all(sum) %>% 
    mutate(quarter = quarter(date)) %>% 
    select(quarter, everything(), -date)
#> # A tibble: 2 x 4
#>   quarter      a     b     c
#>     <int>  <dbl> <dbl> <dbl>
#> 1       1 -21.3  11.4   8.76
#> 2       2   3.35  2.17 -1.09

^{Created on 2019-09-06 by the reprex package (v0.3.0.9000)}

Udayasoorian · September 13, 2019, 6:22am

Hi andre,
Thanks for your solution. I am able to utilize it . Thanks once again.

Regards,
Uday

system · September 20, 2019, 6:22am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.