Converting to SQL into DPLYR

I have the following two tables in R (the "Dates" appear as "Factors" in my data):

table_1 = data.frame(id = c("123", "123", "125", "125"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id = as.factor(table_1$id)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id = c("123", "123", "125", "125"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id = as.factor(table_2$id)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

I am trying to perform an "inner join" if either Condition 1 OR Condition 2 is true:

Condition_1

if table_1$id = table_2$id AND

if table_1$date BETWEEN(table_2$date_2,table_2$date_3)

Condition_2

if table_1$id2 = table_2$id2

Question: I tried to use the code below to achieve the above task:

library(sqldf)

sqldf("select distinct * 
  from table_1 a inner join table_2 b
  on (a.date_1 between b.date_2 and b.date_3) and 
     a.id = b.id or (a.id2 = b.id2)")

Can someone please tell me if it is possible to convert this code into "dplyr"? I have heard that "dplyr" is more efficient at performing such opperations on larger datasets - is this true?

Thanks!

1 Like

Im not work whit SQL in R, but I find this information, may be could help you, Omario.

1 Like

Not quite. If the query works in SQL, it will run faster than in R, and you're better off saving the SQL query into a new table and bringing that into the R session.

1 Like

thank you! I will take a look at this!

Thank you for your reply! Do you know if there is a standard way to convert this code into DPLYR?
Thank you!

You can bring both tables into an R session from a MySQL/MariaDB database, with code like this

library(DBI)
library(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, username="root", password="", dbname ="dlf", host="localhost")

tab1 <- dbGetQuery(con, "SELECT fico FROM y6")
tab2 <- dbGetQuery(con, "SELECT cltv FROM y6")

and then do filtering and joins with dplyr syntax.

Since you have a non-equi join going on, anything in dplyr wouldn’t be more efficient than what you currently have, IMO. The data.table package would work well here. Check out this blog on a situation similar to yours:

2 Likes

@ michaelbgarcia :

Thank you so much for your reply! I will check this out!

If you have time, can you please take a look at the SQL code from my question and let me know if the logic I am using is correct?

Thank you!

You can use this script:

union(inner_join(table_1, table_2, sql_on = "table_1.date_1 between table_2.date_2 and tatable_2.date_3", by ="id" ,keep = TRUE), inner_join(table_1, table_2, by = "id2", keep = TRUE))

1 Like

@ WindyNite1: Thank you so much for your reply! Do you think that the following line of code will work (I placed your code through the SQL function)?

library(RODBC)
library(sqldf)

con = odbcConnect("some name", uid = "some id", pwd = "abc")

#not sure if this is correct?
sample_query = sqlQuery(con, "union(inner_join(table_1, table_2, sql_on = "table_1.date_1 between table_2.date_2 and tatable_2.date_3", by ="id" ,keep = TRUE), inner_join(table_1, table_2, by = "id2", keep = TRUE))")

Thanks!

@omario : the script is used for dplyr package based on your post 7 days ago. With your current code using RODBC package, it does not work because sqlQuery function() is used to submit a SQL query to database. In this case, the SQL query can be :

SELECT distinct * FROM table_1 JOIN table_2
ON table_1.id = table_2.id
WHERE table1.date_1 between table_2.date_2 and table_2.date_3
UNION
SELECT distinct * FROM table_1 JOIN table_2
ON table_1.id2 = table_2. id2

or the query can be your code being written in another way.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.