In the R programming language, I am interested in performing a "fuzzy join" and passing this through a SQL Connection:
library(fuzzyjoin)
library(dplyr)
library(RODBC)
library(sqldf)
con = odbcConnect("some name", uid = "some id", pwd = "abc")
sample_query = sqlQuery( stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) %>%
filter(date_1 >= date_2, date_1 <= date_3) )
view(sample_query)
However, I do not think this is possible, because the function which us being used for the "fuzzy join" (stringdist_inner_join) is not supported by Netezza.
I tried to find the source code for this "fuzzy join" function, and found it over here: fuzzyjoin source: R/stringdist_join.R
My Question: Does anyone know if it is possible to (manually) convert this "fuzzy join" function into an SQL format that will be recognized by Netezza? Are there any quick ways to re-write this function (stringdist_inner_join) such that it can be recognized by Netezza?
Right now I can only execute "sample_query" on locally - re-writing this function (stringdist_inner_join) would let perform the "sample_query" much faster.
Does anyone know if this is possible?
Thanks!
Note:
My data looks like this:
table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))
table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)
table_2 = data.frame(id1 = c("0123", "1233", "125 .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))
table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)