`group_by` with OR

queermath · March 1, 2018, 11:08pm

Hi all!

I'm using the tidyverse to aggregate some information about websites which has several different (and unfortunately conflicting) identifiers. I can't share my data, but for example there are three columns: one with a "unique" ID number, one with the website name, and one with the URL--except sometimes two sites will have the same name but different ID numbers, or the same URL but different IDs, as below:

data <- data.frame(
    id = c(1, 2, 2, 3),
    name = c("Bed Bath and Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond"),
    url = c("bedbathandbeyond.com", "bedbathandbeyond.com", "bedbathandbeyond.com", "www.bedbathandbeyond.com")
)

As you can see, none of these individual variables is sufficient to group all of the relevant entries together--and using all three variables in a group_by statement would result in three separate rows that should only be one.

All this to say, is it possible to do a group_by that checks for matches on id OR name OR url? Right now I'm using complicated left_joins but it's loftier than I'd like.

TIA!

rensa · March 2, 2018, 12:54am

Maybe you could use regex to try and boil the name or the url down to something that captures the variations? I'm terrible at regex, but even if you just did something like extracting the naked domain name from the url column, you could something suitable.

kentm · March 2, 2018, 1:35am

Is there a reason you can't group by only URL? It seems like the URL would differentiate different sites regardless of what the ID / name is. You can than generate a new name (and/or ID) based on the URL.

alistaire · March 2, 2018, 1:46am

You could use adist or extended versions in the excellent stringdist to calculate the edit distance between strings, and cluster accordingly. On sizable data where you can't check if the results make sense, it may be a pain, though.

If you've got a correct set to match, it's easier.

queermath · March 2, 2018, 2:15pm

Yea! It's inconsistent whether or not a URL starts with www. or https:// or both or nothing at all; this is the column with the smallest amount of variation, though, so it may be the best to use with regex.

queermath · March 2, 2018, 2:23pm

Yeah, we're working with 100k+ rows of data into about 1300 sites

That said, this package may be really useful for some other stuff I'm working on, so thanks!!

queermath · March 2, 2018, 2:23pm

This may be what I have to do

kentm · March 2, 2018, 2:32pm

Yep- seems like the best option, in my opinion, is to do some data cleaning on one of these columns to get a clean grouping variable and then use that. The URL should be the easiest because you can remove any values such as https://, www., .com and you will be left with the host (e.g. bedbathandbeyond).

rensa · March 3, 2018, 1:29am

urltools::url_parse looks like a good way to do this, @queermath and @kentm Would probably need to pair it with purrr:map to just pull the domain out of a column of URLs!

kentm · March 3, 2018, 2:35am

@rensa Nice! I don't work much with web data but this is a nice package to know about. Thanks for the recommendation.

queermath · March 3, 2018, 5:14am

ooh! i ended up working out 95% of the problem by removing the http(s)://www. from the URLs and hard-coding the rest; will have to see if this brings me 98% of the way there on monday! will also be helpful for future projects, i'm sure.