Hello,
I am dealing with an insurance dataset of +160k rows where each observation stands for an insurance policy (contract).
Data are consolided from a network of insurance agencies where the client ID is unique per agency only. Therefore, if a client subscribes two insurance policies in two different agencies, we will get two different (02) IDs for the same client.
In order to fix this and obtain a 360 view of policyholders and their claims behavior, I am:
- First part of the algorithm:
- Calculating similarite1, which represents names based similarity using the
stringsim()
function of thestringdist
- Calculating similarite2, which represents adresses based similarity using the
stringsim()
function of thestringdist
- Second part of the algorithm:
- Assessing if
similarite1 > 0.92
- Assessing if any of the 4 combinations between
Phone1
&Phone2
equality fori
&j
returnsTRUE
- Assessing if
similarite2 > 0.15
(0.15 due to very heterogeneous ways of writing adresses) - If any combination
i
vsj
falls into the above conditions:
- **
j
is affected to the value of theNew_ID
or(A[i, 7])
- ** the adresse with most length between row i & j is kept as
New_ADRESSE
or(A[i, 8])
Problem:
The main problem I am facing is related to performance, It will take days to achieve the 160k rows treament !
I thought there is an obligation to parallelize ...
Here is below a data sample that have the same type & structure as the real data and the writed code.
Data sample (not real data):
library(tibble)
A <- tibble::tribble(
~Name, ~Adresse, ~Phone1, ~Phone2, ~Name_first_charc, ~Name_length, ~New_ID, ~New_ADRESSE,
"FFFFFF NNNNN", "CITE80/800 LOGTS AIN AZEL", "0670333256", "0780222256", "F", 12 , "", "",
"FFFFFFF NNNNN", "CITE80/800 LOGTS", "0780222256", NA, "F", 13 , "", "",
"AAAAAAAAA MMMMMM", "AIT SMAIL", "0264587452", NA, "A", 16 , "", "",
"IIIII KKKKK", "CONSTANTINE", "0925485215", "0425485999", "I", 11 , "", "",
"IIIII KKKKKK", "CONSTANTINE", "0925485215", NA, "I", 11 , "", "",
"FFFFFF NNNNN", "CITE80/800 LOGTS AIN AZEL", "0670333256", "0780222256", "F", 12 , "", "",
"FFFFFFF NNNNN", "CITE80/800 LOGTS", "0780222256", NA, "F", 13 , "", "",
"AAAAAAAAA MMMMMM", "AIT SMAIL", "0264587452", NA, "A", 16 , "", "",
"IIIII KKKKK", "CONSTANTINE", "0925485215", "0425485999", "I", 11 , "", "",
"IIIII KKKKKK", "CONSTANTINE", "0925485215", NA, "I", 11 , "", ""
)
Algorithm
library(tidyverse)
library(stringr)
library(stringdist)
L <- nrow(A)
for(i in 1:L){
for(j in 1:L){
# Avoiding redundancy
if(j>=i){
# Segmentation 1: restrict calculations only for names starting with the same letter
if(A$Name_first_charc[i] == A$Name_first_charc[j]){
# Segmentation 02: restrict calculations for name(i) with name(j) that have more or less 3 characters
if(A$Name_length[i] >= A$Name_length[j] - 3 &
A$Name_length[i] <= A$Name_length[j] + 3 ){
## Similarities calculation part ************************************************************
# Names similarities calculations
similarite1 <- stringsim(as.character(A[i, 1]),
as.character(A[j, 1]),
method = "osa",
useBytes = TRUE)
# Adresses similarities calculations
similarite2 <- stringsim(as.character(A[i, 2]),
as.character(A[j, 2]),
method = "qgram",
useBytes = TRUE)
## Decision part ****************************************************************************
if(similarite1 > 0.92) {
if(!is.na(similarite2) ){
if(A$Phone1[i] == A$Phone1[j] |
A$Phone1[i] == A$Phone2[j] |
A$Phone2[i] == A$Phone2[j] |
A$Phone2[i] == A$Phone1[j] ){
if(similarite2 > 0.15){
# Assign the new ID to all qualifying rows
A[i , 7] = as.character(j) # New ID
# Choose the longest address as the new address of the multiple insured
A[i , 8] = ifelse(str_length(A[j , 2]) >= str_length(A[i , 2]) ,
A[j , 2],
A[i , 2]) # New adress
}
}
}
}
## **************************************************************************************
} # segmentation 2 end
} # segmentation 1 end
} # avoiding redandancy if statemnt end (j>=i)
} # End of the j loop
# Print advancement ********************************************************************
print(paste(i,"Vs", j, "Advancement", round( (i/ L) * 100 ,2), "%" ))
## **************************************************************************************
} # # End of the i loop
Questions:
- What type of parallelism does best apply to this problem? Computationally bound or data bound?
- Can we separate the calculations both based on data and computationally (as I am treating names that starts with the same letter separately)
- From several readings I have seen that there is a need to replace the nested
for
loops by loops from theApply
familly , is it mandatory in order to implement parallelism? - Which package do you recommend based on the relative simplicity of implemting parallelism (parallel, foreach ...)?
- What can be the best way to devide the algorithm, I thought about four (04) parts:
- 1st function that calculates name based similarities
- 2nd function that calculates adresses based similarities
- 3rd function that allocates the
New_ID
(A[i, 7]
) - 4th function that allocates (harmonize) the
New_ADRESSE
(A[i, 8]
)
Please don't hesitate to criticize the algorithm if you think its performance can be enhanced before thinking about parallelism.
Thanks a lot