I'm trying to group a list of addresses for a bunch of individuals—an individual can have more than 1 address mapped to him—while addresses are captured in the system with all manual inconsistencies e.g. typo (or) additional info/title in some versions of same address.
library(tidyverse)
df <- tibble(
individuals = c(1, 1, 1, 1, 2, 2),
addresses = c(
'king st toronto',
'queen st',
'king toronto',
'broadway st',
'broadway ave',
'attn: broadway ave'
)
)
It doesn't matter which one of an address' variation I'm choosing finally, but all that is required is, group/recognize them as ONE same address, say, in a new column.
I used Levenshtein edit distance, along with baseR's apply and sapply as shown below to do fuzzy matching, and then map to 1 unique address(in fuzzy sense) per individual (here I picked the variation with fewer characters but any one representation is okay).
matches <-
sapply(df[['addresses']], function(pattern)
agrepl(pattern, df[['addresses']], max.distance = 0.3))
apply(matches , 1, function(arg)
df[['addresses']][arg][which.min(nchar(df[['addresses']][arg]))])
This code works as stand-alone for 1 group, but I'm not able to generalize it to entire data.frame with multiple groups, say in a dplyr/groupby setup. I tried using plyr:ddply(data.frame, .(groupby_var), <FUNCTION>
) but ran into error 'Error in apply() dim(X) must have a positive length'.
Expected Output:
individuals | addresses |
---|---|
1 | king toronto |
1 | queen st |
1 | king toronto |
1 | broadway st |
2 | broadway ave |
2 | broadway ave |