Hi All. I have an interesting and frankly frustrating problem I'm trying to work through and I'm not sure what I'm doing wrong. Long story short, process is being moved from SaS to R so we have to rewrite some legacy code.
We take two dataframes, combine them by Zip code(duplicates will exist) and then do a stringdist comparison between various columns. Here's the general issue:
If we don't use an iterative approach, we run out of ram. Quickly. This is on a 128gig of RAM EC2 instance mind you..
If we use an iterative approach, it takes FOREVER. We're running Microsoft R Open so my assumption is that this is by default running in a parallel fashion. We have already isolated this to the loop. Almost the entire process after the loop takes reasonable amount of time. If we take the stringdist functionality out of the loop, it will run, however on larger datasets we get back to running out of RAM.
I'm fairly new to R so I'm sure that I'm just going something "wrong" in trying to convert it from the legacy SaS code.
Here's the code that we have isolated as the bottleneck. Essentially the idea is to grab a row, merge it from df1 to df2 by zips, then apply the stringdist logic. If it's a match write to .csv, remove DF and repeat, etc. No match, move to next row.
for(row in 1:length(df1$Zip)) {
df1 <- inner_join(df1, df2, by = c('Zip'))
df1[] <- lapply(df1, as.character)
df1$MatchBizName <- pmax(1-stringdist(df1$BusinessName, df1$BusinessName, method="jw", p=0.1),
1-stringdist(df1$DBAName, df1$DBAName, method="jw", p=0.1),
1-stringdist(df1$BusinessName, df1$DBABusinessName, method="jw", p=0.1),
1-stringdist(df1$DBAName, df1$BusinessName, method="jw", p=0.1))
df1$MatchPhone <- ifelse(1-stringdist(df1$Phone, df1$Phone, method="jw", p=0.1)>=1,1,0)
df1$MatchFirstName <- 1-stringdist(df1$FirstName, df1$FirstName, method="jw", p=0.1)
df1$MatchLastName <- 1-stringdist(df1$LastName, df1$LastName, method="jw", p=0.1)
df1$MatchStreetNum <- 1-stringdist(df1$StreetNum, df1$StreetNum, method="jw", p=0.1)
df1$MatchStreetName <- 1-stringdist(df1$StreetName, df1$StreetName, method="jw", p=0.1)
df1$MatchCity <- 1-stringdist(df1$City, df1$City, method="jw", p=0.1)
df1matches <- subset(df1,(MatchBizName >= 0.9 & MatchCity == 1 & MatchStreetName >= 0.7 & MatchStreetNum >= 0.7) |
(MatchPhone == 1 & MatchFirstName == 1 & MatchLastName == 1 & MatchStreetNum == 1 & MatchStreetName == 1 & MatchCity == 1) |
(MatchBizName >= 0.9 & MatchStreetNum == 1 & MatchStreetName >= 0.9 & MatchCity == 1) |
(MatchStreetNum == 1 & MatchPhone == 1 & MatchStreetName == 1 & MatchCity == 1) |
(MatchLastName >= 0.9 & MatchPhone == 1 & MatchBizName >= 0.9 & MatchCity == 1) |
(MatchPhone == 1 & MatchCity == 1 & MatchStreetNum == 1 & MatchStreetName >= 0.9 & MatchBizName >= 0.6) )
rm(df1)
rm(df1matches)
}