Hello,
I have a table with ~1M entry points (where each line is an insurance contract, i.e. one client can have multiple contracts) and cols client_id
, names
and adresses
. The problem I am trying to solve is that the same client can have different client_id
for each new contract.
To resolve this I have done the following:
- Creating a
New_ID
as a 4th col in the table - Iterate twice over
names
and calculate names similarity for each combination - Iterate twice over
adresses
and calculate names similarity for each combination - Inside each iteration: if
name_similarity
> 0.9 &adresses_similarity
> 0.8 thenNew_ID
takes the value ofj
Used packages + fake data:
library(tidyverse)
library(stringdist) # strings' similarities
library(parallel) # parallel programming
library(foreach) # parallel programming
library(doParallel) # parallel programming
library(doSNOW) # parallel programming
# Fake data
client_id <- 1:6
names <- c("Name", "Naaame", "Name", "Namee", "Nammee", "Nammee")
adresses <- c("Adress", "Adressss", "Adress", "Adresss", "Aadressss", "Aadressss")
A <- data.frame(cbind(client_id, names, adresses)) %>%
mutate(New_ID = NA)
Nested for loops
The below nested for loops works well:
for(i in seq_along(A$client_id)){
for(j in seq_along(A$client_id)){
# calculate names similarities
name_similarity <- stringdist::stringsim(A$names[i],
A$names[j],
method = "osa",
useBytes = T)
# calculate adresses similarities
adresses_similarity <- stringdist::stringsim(A$adresses[i],
A$adresses[j],
method = "qgram",
useBytes = T)
# Decision & New_ID attribution
if(name_similarity > 0.9) {
if(adresses_similarity > 0.85){
A[i , 4] = j # New ID
}
} # decision end
} # Close j loop
} # Close i loop
Although the script above produces the expected result, it will take days to iterate over the real data size (~ 1M). So I thought of parallel programming.
Parallel programming:
I have tried to nest two foreach using the operator %:%
and run it in parallel using the operator %dopar%
of the doParallel
package.
cl <- makeCluster(detectCores()) # Intiate clusters (I have 8 cores on my local machine)
registerDoSNOW(cl) # relate foreach to a parallel mecanism from {parallel}
clusterExport(cl, list("A")) # export data to clusters
clusterEvalQ(cl, c(library(tidyverse),
library(stringdist))) # export used packages to child clusters
foreach(i = seq_along(A$client_id) ) %:%
foreach(j = seq_along(A$client_id)) %dopar%{
# calculate names similarities
name_similarity <- stringdist::stringsim(A$names[i],
A$names[j],
method = "osa",
useBytes = T)
# calculate adresses similarities
adresses_similarity <- stringdist::stringsim(A$adresses[i],
A$adresses[j],
method = "osa",
useBytes = T)
# Decision & New_ID attribution
if(name_similarity > 0.9) {
if(adresses_similarity > 0.85){
A[i , 4] = j # New ID
}
} # decision end
}
stopCluster(cl)
However, after running the parallel nested foreach
loops, the New_ID
column still empty. I've tried to unlist()
the result as the foreach
loop returns values in list, it doesn't work.
How can I write the nested parallel foreach
to obtain the same result as in the nested for
loops? Thanks