Fuzzy matching two datasets

jeremyz · December 11, 2023, 2:50pm

Hello. I have received amazing help from this community before. I really appreciate it!

I have two datasets. The first, "Advertised", is a list of toothpaste brands that have been advertised. The second, "Sold", is a list of toothpaste products that have been sold. My goal is to match the list of items in "Advertised" to the best matches in "Sold". Note that in the real data, there are more records in the "Sold" dataset than in "Advertised"

Here is my example:

Advertised <- data.frame(BrandVariant = c("Crest Cavity Protection",
                                          "Colgate Cavity Protection",
                                          "Pepsodent Clean Mint USA"))

Sold <- data.frame(ID = c(1, 2, 3),
                   Ultimate.Company = c("Colgate-Palmolive", "Procter & Gamble-Crest", "Unilever"),
                   Product = c("Colgate 360", "Crest Cavity", "Pepsodent Mint"),
                   Product.Description = c("Colgate's first 360 degree whitening toothpaste",
                                           "Fast acting and whiteness with Crest's cavity buster and protector",
                                           "A clean mint taste for healthy gums"))```

There are multiple columns in "Sold" that I would like to match against. Best case scenario would be if I could have the ID value from "Sold" lined up with the best match from "Advertised" along with a similarity score.

Matthias · December 12, 2023, 8:22am

There is a package called fuzzyjoin that can do the trick!

library(fuzzyjoin)
library(dplyr)

Sold_Products <- Sold %>%
  stringdist_left_join(Advertised, by = c(Product = "BrandVariant"), # column names in Sold & Advertised to match 
                       max_dist = 12,                                # distance threshold
                       distance_col = "distance")                    # add a column for the distance

You may need to adjust the distance-value, to include as many positve hits and as few false hits as possible.

system · January 2, 2024, 8:23am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.