Hi,
This is a continuation of the question posed in identifying Identifying Fuzzy Duplicates from a column
I have been playing around with the example and i appear to have hit a wall
If i take the previous example with the aim of cleansing the duplicates to a single Item, I am coming a bit unstuck.
To give you an example, if i look at the shipper "CANON PVT. LTD" i get four matches back. The match with itself and the match with "CANON PVT. LTD." (the difference is the full stop)
Equally the latter appears again matching to the former, so i now have 4 records. I would require only two records. For example I know that "CANON PVT. LTD" could act as a key for all four matches. Can anyone see a way on how to remove the additional double match
Equally in the example, i would like to cleanse Antila,Thomas
and ANTILA,THOMAS
to just Antila,Thomas
in maybe a third grouping column
Apologies if this seems a bit confusing
library(tidystringdist)
library(tidyverse)
# Sample data
df <- data.frame(stringsAsFactors = FALSE,
Name = as.factor(c(" CANON PVT. LTD ", " Antila,Thomas ", " Greg ",
" St.Luke's Hospital ", " Z_SANDSTONE COOLING LTD ",
" St.Luke's Hospital ", " CANON PVT. LTD. ",
" SANDSTONE COOLING LTD ", " Greg ", " ANTILA,THOMAS "))
)
fuzzy_match_string <- function(x, y, jw_dist){
grid_df <- expand.grid(unique(x),
unique(y)) %>%
rename(V1 = Var1, V2 = Var2) %>%
mutate_all(as.character) %>%
tidy_stringdist(.) %>%
filter(jw < jw_dist) %>%
select(V1, V2, jw) %>%
unique()
}
map_df(df$Name, fuzzy_match_string, df$Name, 0.07) %>%
unique() %>%
arrange(V1)
#> # A tibble: 12 x 3
#> V1 V2 jw
#> <chr> <chr> <dbl>
#> 1 " Antila,Thomas " " Antila,Thomas " 0
#> 2 " ANTILA,THOMAS " " ANTILA,THOMAS " 0
#> 3 " CANON PVT. LTD " " CANON PVT. LTD " 0
#> 4 " CANON PVT. LTD " " CANON PVT. LTD. " 0.0196
#> 5 " CANON PVT. LTD. " " CANON PVT. LTD " 0.0196
#> 6 " CANON PVT. LTD. " " CANON PVT. LTD. " 0
#> 7 " Greg " " Greg " 0
#> 8 " SANDSTONE COOLING LTD " " Z_SANDSTONE COOLING LTD " 0.0267
#> 9 " SANDSTONE COOLING LTD " " SANDSTONE COOLING LTD " 0
#> 10 " St.Luke's Hospital " " St.Luke's Hospital " 0
#> 11 " Z_SANDSTONE COOLING LTD " " Z_SANDSTONE COOLING LTD " 0
#> 12 " Z_SANDSTONE COOLING LTD " " SANDSTONE COOLING LTD " 0.0267