I have a table which contains name of vendors along with their other details such as address, telephone no etc. I need to identify the name of vendors who are similar to each other. I was successful in finding exact duplicate vendors, but it becomes difficult with fuzzy duplicates. Here is just a sample data set:
I went through a similar post like this identifying exact or near duplicate names in a dataset and tried using tidy_comb_all and tidy_stringdist, however, it gives me the result of the column 'Name', whereas, I want the column 'city' as well (i.e, the entire dataframe. My original data has many other columns and I want information of all of them). How is it possible to achieve this? @john.smith Would you be able to look into this problem? Thanks in advance!
@andresrcs Thanks a lot Andresrcs for taking out time to solve this problem! This code seems to work perfect with the sample code. But I fail to understand the logic of parameters inside filter and gather functions. What does 'soundex==0' and 'starts_with("V")' imply?
When I try to work with this code on my original dataset, it gives out names in the output which are little similar but with large differences, e.g., 'Aadiyta' and 'Aaram Techserve'. Also, which method is the code using for calculating string distance?
I had written the following code for my original dataset:
This worked pretty well in identifying the differences, but the problem with this code is that it would compare 'Antila, Thomas' with 'ANTILA, THOMAS' once and would again compare 'ANTILA, THOMAS' with 'Antila, Thomas', so, it is generating duplicates in the output. Also, it only lists out the 'Name' column whereas, I want to have all the other columns of my original dataset in the output. How is that achievable?
I simply choosed a random metric and threshold as example, selecting the most appropriate method, metric and threshold is up to you and, in my opinion, requires more domain specific knowledge for fine tuning.
starts_with("V") is for selecting the columns containing the matching combinations, both start with "V" (i.e. V1, V2).
If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it: