Converting to SQL into DPLYR

omario · December 3, 2021, 2:14am

I have the following two tables in R (the "Dates" appear as "Factors" in my data):

table_1 = data.frame(id = c("123", "123", "125", "125"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id = as.factor(table_1$id)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id = c("123", "123", "125", "125"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id = as.factor(table_2$id)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

I am trying to perform an "inner join" if either Condition 1 OR Condition 2 is true:

Condition_1

if table_1$id = table_2$id AND

if table_1$date BETWEEN(table_2$date_2,table_2$date_3)

Condition_2

if table_1$id2 = table_2$id2

Question: I tried to use the code below to achieve the above task:

library(sqldf)

sqldf("select distinct * 
  from table_1 a inner join table_2 b
  on (a.date_1 between b.date_2 and b.date_3) and 
     a.id = b.id or (a.id2 = b.id2)")

Can someone please tell me if it is possible to convert this code into "dplyr"? I have heard that "dplyr" is more efficient at performing such opperations on larger datasets - is this true?

Thanks!

M_AcostaCH · December 3, 2021, 2:45am

Im not work whit SQL in R, but I find this information, may be could help you, Omario.

technocrat · December 3, 2021, 3:22am

Not quite. If the query works in SQL, it will run faster than in R, and you're better off saving the SQL query into a new table and bringing that into the R session.

omario · December 3, 2021, 3:25am

thank you! I will take a look at this!

omario · December 3, 2021, 3:25am

Thank you for your reply! Do you know if there is a standard way to convert this code into DPLYR?
Thank you!

technocrat · December 3, 2021, 3:37am

You can bring both tables into an R session from a MySQL/MariaDB database, with code like this

library(DBI)
library(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, username="root", password="", dbname ="dlf", host="localhost")

tab1 <- dbGetQuery(con, "SELECT fico FROM y6")
tab2 <- dbGetQuery(con, "SELECT cltv FROM y6")

and then do filtering and joins with dplyr syntax.

michaelbgarcia · December 3, 2021, 3:51am

Since you have a non-equi join going on, anything in dplyr wouldn’t be more efficient than what you currently have, IMO. The data.table package would work well here. Check out this blog on a situation similar to yours:

omario · December 3, 2021, 4:51am

@ michaelbgarcia :

Thank you so much for your reply! I will check this out!

If you have time, can you please take a look at the SQL code from my question and let me know if the logic I am using is correct?

Thank you!

WindyNite1 · December 8, 2021, 11:17pm

You can use this script:

union(inner_join(table_1, table_2, sql_on = "table_1.date_1 between table_2.date_2 and tatable_2.date_3", by ="id" ,keep = TRUE), inner_join(table_1, table_2, by = "id2", keep = TRUE))

omario · December 10, 2021, 3:12am

@ WindyNite1: Thank you so much for your reply! Do you think that the following line of code will work (I placed your code through the SQL function)?

library(RODBC)
library(sqldf)

con = odbcConnect("some name", uid = "some id", pwd = "abc")

#not sure if this is correct?
sample_query = sqlQuery(con, "union(inner_join(table_1, table_2, sql_on = "table_1.date_1 between table_2.date_2 and tatable_2.date_3", by ="id" ,keep = TRUE), inner_join(table_1, table_2, by = "id2", keep = TRUE))")

Thanks!

WindyNite1 · December 10, 2021, 4:23am

@omario : the script is used for dplyr package based on your post 7 days ago. With your current code using RODBC package, it does not work because sqlQuery function() is used to submit a SQL query to database. In this case, the SQL query can be :

SELECT distinct * FROM table_1 JOIN table_2
ON table_1.id = table_2.id
WHERE table1.date_1 between table_2.date_2 and table_2.date_3
UNION
SELECT distinct * FROM table_1 JOIN table_2
ON table_1.id2 = table_2. id2

or the query can be your code being written in another way.

system · December 31, 2021, 4:23am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.