How to re-use the fuzzy-matching (for address) code built with apply/sapply, in a data.frame's groupby setup i.e. match within groups?

balaca · March 25, 2022, 6:45pm

I'm trying to group a list of addresses for a bunch of individuals—an individual can have more than 1 address mapped to him—while addresses are captured in the system with all manual inconsistencies e.g. typo (or) additional info/title in some versions of same address.

library(tidyverse)
df <- tibble(
  individuals = c(1, 1, 1, 1, 2, 2),
  addresses = c(
    'king st toronto',
    'queen st',
    'king toronto',
    'broadway st',
    'broadway ave',
    'attn: broadway ave'
  )
)

It doesn't matter which one of an address' variation I'm choosing finally, but all that is required is, group/recognize them as ONE same address, say, in a new column.

I used Levenshtein edit distance, along with baseR's apply and sapply as shown below to do fuzzy matching, and then map to 1 unique address(in fuzzy sense) per individual (here I picked the variation with fewer characters but any one representation is okay).

matches <-
  sapply(df[['addresses']], function(pattern)
    agrepl(pattern, df[['addresses']], max.distance = 0.3))

apply(matches , 1, function(arg)
  df[['addresses']][arg][which.min(nchar(df[['addresses']][arg]))])

This code works as stand-alone for 1 group, but I'm not able to generalize it to entire data.frame with multiple groups, say in a dplyr/groupby setup. I tried using plyr:ddply(data.frame, .(groupby_var), <FUNCTION>) but ran into error 'Error in apply() dim(X) must have a positive length'.

Expected Output:

individuals	addresses
1	king toronto
1	queen st
1	king toronto
1	broadway st
2	broadway ave
2	broadway ave

pieterjanvc · March 26, 2022, 12:26pm

Hi there,

I'm not sure if I get what you want, but here is an attempt

library(tidyverse)
df <- tibble(
  individuals = c(1, 1, 1, 1, 2, 2,3,3,3,3),
  addresses = c(
    'king st toronto',
    'queen st',
    'king toronto',
    'broadway st',
    'broadway ave',
    'attn: broadway ave',
    'main st.',
    'main street',
    'center st',
    'center street'
  )
)


bestAddress = function(x){
  
  #Your old logic
  matches <-
    sapply(x, function(pattern)
      agrepl(pattern, x, max.distance = 0.3))
  
  matches = apply(matches , 1, function(arg){
    x[arg][which.min(nchar(x[arg]))]
  }) 
  
  #Get the most used name, break ties by length
  matches = tibble(name = matches) %>% 
    group_by(name) %>% 
    mutate(n = n(), nchar = nchar(name)) %>% 
    ungroup() %>%  arrange(desc(n), nchar) %>% 
    slice(1) %>% pull(name)
  
  rep(matches, length(x))
}


df %>% group_by(individuals) %>% 
  mutate(bestAddress = bestAddress(addresses)) %>% 
  ungroup()
#> # A tibble: 10 x 3
#>    individuals addresses          bestAddress 
#>          <dbl> <chr>              <chr>       
#>  1           1 king st toronto    king toronto
#>  2           1 queen st           king toronto
#>  3           1 king toronto       king toronto
#>  4           1 broadway st        king toronto
#>  5           2 broadway ave       broadway ave
#>  6           2 attn: broadway ave broadway ave
#>  7           3 main st.           main st.    
#>  8           3 main street        main st.    
#>  9           3 center st          main st.    
#> 10           3 center street      main st.

^{Created on 2022-03-26 by the reprex package (v2.0.1)}

I reused most of your logic, but then once the best street names are chosen per individual, I take the one that occurs most often and break a tie by picking the shortest address (see new individual 3).

Hope this helps,
PJ

system · April 16, 2022, 12:27pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.