What type of parallelism may best apply to unique clients identification using names & adresses similarities

Hello,
I am dealing with an insurance dataset of +160k rows where each observation stands for an insurance policy (contract).

Data are consolided from a network of insurance agencies where the client ID is unique per agency only. Therefore, if a client subscribes two insurance policies in two different agencies, we will get two different (02) IDs for the same client.

In order to fix this and obtain a 360 view of policyholders and their claims behavior, I am:

  1. First part of the algorithm:
  • Calculating similarite1, which represents names based similarity using the stringsim() function of the stringdist
  • Calculating similarite2, which represents adresses based similarity using the stringsim() function of the stringdist
  1. Second part of the algorithm:
  • Assessing if similarite1 > 0.92
  • Assessing if any of the 4 combinations between Phone1 & Phone2 equality for i & j returns TRUE
  • Assessing if similarite2 > 0.15(0.15 due to very heterogeneous ways of writing adresses)
  • If any combination i vs j falls into the above conditions:
  1. ** j is affected to the value of the New_ID or (A[i, 7])
  2. ** the adresse with most length between row i & j is kept as New_ADRESSE or (A[i, 8])

Problem:
The main problem I am facing is related to performance, It will take days to achieve the 160k rows treament ! :expressionless:
I thought there is an obligation to parallelize ...

Here is below a data sample that have the same type & structure as the real data and the writed code.

Data sample (not real data):

library(tibble)

A <- tibble::tribble(
  ~Name, ~Adresse,  ~Phone1, ~Phone2, ~Name_first_charc, ~Name_length, ~New_ID, ~New_ADRESSE, 
  "FFFFFF NNNNN", "CITE80/800 LOGTS AIN AZEL",  "0670333256", "0780222256", "F", 12 , "", "", 
  "FFFFFFF NNNNN", "CITE80/800 LOGTS",  "0780222256", NA, "F", 13 , "", "", 
  "AAAAAAAAA MMMMMM", "AIT SMAIL",  "0264587452", NA, "A", 16 , "", "", 
  "IIIII KKKKK", "CONSTANTINE",  "0925485215", "0425485999", "I", 11 , "", "", 
  "IIIII KKKKKK", "CONSTANTINE",  "0925485215", NA, "I", 11 , "", "", 
  "FFFFFF NNNNN", "CITE80/800 LOGTS AIN AZEL",  "0670333256", "0780222256", "F", 12 , "", "", 
  "FFFFFFF NNNNN", "CITE80/800 LOGTS",  "0780222256", NA, "F", 13 , "", "", 
  "AAAAAAAAA MMMMMM", "AIT SMAIL",  "0264587452", NA, "A", 16 , "", "", 
  "IIIII KKKKK", "CONSTANTINE",  "0925485215", "0425485999", "I", 11 , "", "", 
  "IIIII KKKKKK", "CONSTANTINE",  "0925485215", NA, "I", 11 , "", ""
) 

Algorithm

library(tidyverse)
library(stringr) 
library(stringdist)

L <- nrow(A)

for(i in 1:L){
  for(j in 1:L){
    # Avoiding redundancy
    if(j>=i){
      # Segmentation 1: restrict calculations only for names starting with the same letter
      if(A$Name_first_charc[i] == A$Name_first_charc[j]){
        
        # Segmentation 02: restrict calculations for name(i) with name(j) that have more or less 3 characters
        if(A$Name_length[i] >= A$Name_length[j] - 3 & 
           A$Name_length[i] <= A$Name_length[j] + 3 ){
          
          ## Similarities calculation part ************************************************************
          # Names similarities calculations 
          similarite1 <- stringsim(as.character(A[i, 1]), 
                                   as.character(A[j, 1]), 
                                   method = "osa", 
                                   useBytes = TRUE)
          
          # Adresses similarities calculations 
          similarite2 <- stringsim(as.character(A[i, 2]), 
                                   as.character(A[j, 2]), 
                                   method = "qgram", 
                                   useBytes = TRUE)
          
          ## Decision part **************************************************************************** 
          if(similarite1 > 0.92) {
            if(!is.na(similarite2) ){
              
              if(A$Phone1[i] == A$Phone1[j] | 
                 A$Phone1[i] == A$Phone2[j] | 
                 A$Phone2[i] == A$Phone2[j] | 
                 A$Phone2[i] == A$Phone1[j] ){
                
                if(similarite2 > 0.15){
                  
                  # Assign the new ID to all qualifying rows 
                  A[i , 7] = as.character(j) # New ID
                  # Choose the longest address as the new address of the multiple insured
                  A[i , 8] = ifelse(str_length(A[j , 2]) >= str_length(A[i , 2]) , 
                                    A[j , 2], 
                                    A[i , 2]) # New adress
                }
              } 
            }    
          }
          ## **************************************************************************************
          
        } # segmentation 2 end 
      } # segmentation 1 end 
    } # avoiding redandancy if statemnt end (j>=i)
  } # End of the j loop  
  
  # Print advancement  ******************************************************************** 
  print(paste(i,"Vs", j, "Advancement",  round( (i/ L) * 100 ,2), "%" ))
  ## **************************************************************************************
  
} # # End of the i loop 

Questions:

  1. What type of parallelism does best apply to this problem? Computationally bound or data bound?
  2. Can we separate the calculations both based on data and computationally (as I am treating names that starts with the same letter separately)
  3. From several readings I have seen that there is a need to replace the nested for loops by loops from the Apply familly , is it mandatory in order to implement parallelism?
  4. Which package do you recommend based on the relative simplicity of implemting parallelism (parallel, foreach ...)?
  5. What can be the best way to devide the algorithm, I thought about four (04) parts:
  • 1st function that calculates name based similarities
  • 2nd function that calculates adresses based similarities
  • 3rd function that allocates the New_ID (A[i, 7])
  • 4th function that allocates (harmonize) the New_ADRESSE (A[i, 8])

Please don't hesitate to criticize the algorithm if you think its performance can be enhanced before thinking about parallelism.
Thanks a lot :pray:

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.