I'm using the tidyverse to aggregate some information about websites which has several different (and unfortunately conflicting) identifiers. I can't share my data, but for example there are three columns: one with a "unique" ID number, one with the website name, and one with the URL--except sometimes two sites will have the same name but different ID numbers, or the same URL but different IDs, as below:
data <- data.frame(
id = c(1, 2, 2, 3),
name = c("Bed Bath and Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond"),
url = c("bedbathandbeyond.com", "bedbathandbeyond.com", "bedbathandbeyond.com", "www.bedbathandbeyond.com")
)
As you can see, none of these individual variables is sufficient to group all of the relevant entries together--and using all three variables in a group_by statement would result in three separate rows that should only be one.
All this to say, is it possible to do a group_by that checks for matches on id OR name OR url? Right now I'm using complicated left_joins but it's loftier than I'd like.
Maybe you could use regex to try and boil the name or the url down to something that captures the variations? I'm terrible at regex, but even if you just did something like extracting the naked domain name from the url column, you could something suitable.
Is there a reason you can't group by only URL? It seems like the URL would differentiate different sites regardless of what the ID / name is. You can than generate a new name (and/or ID) based on the URL.
You could use adist or extended versions in the excellent stringdist to calculate the edit distance between strings, and cluster accordingly. On sizable data where you can't check if the results make sense, it may be a pain, though.
If you've got a correct set to match, it's easier.
Yea! It's inconsistent whether or not a URL starts with www. or https:// or both or nothing at all; this is the column with the smallest amount of variation, though, so it may be the best to use with regex.
Yep- seems like the best option, in my opinion, is to do some data cleaning on one of these columns to get a clean grouping variable and then use that. The URL should be the easiest because you can remove any values such as https://, www., .com and you will be left with the host (e.g. bedbathandbeyond).
ooh! i ended up working out 95% of the problem by removing the http(s)://www. from the URLs and hard-coding the rest; will have to see if this brings me 98% of the way there on monday! will also be helpful for future projects, i'm sure.