How do I merge two columns exactly and fuzzy merge one column?

glittergirl · May 21, 2023, 2:30pm

I currently have two datasets that I would like to merge:

year <- c("2002", "2002", "1999", "1999", "1997", "2002", "2005", "2008")
state <- c("TN", "TN", "AL", "AL", "CA", "TN", "NY", "NC")
name <- c("Brass; Smith", "Joe", "Christopher", "Bob; Holland", "Wayne", NA, "Joseph A. Freeland", "")

df1 <- data.frame(year, state, name)

year <- c("2002", "2002", "2005")
state <- c("TN", "TN", "NY")
versus <- c("Carl Brass", "Joe", "Freeland")
color <- c("Blue", "Red", "Yellow")

df2 <- data.frame(year, state, versus)

I want to combine year and state exactly. I then want to match name and versus in df1 and df2 respectively. I also want the fuzzy match to occur before the ;.

I want an output that looks like this:

year <- c("2002", "2002", "1999", "1999", "1997", "2002", "2005", "2008")
state <- c("TN", "TN", "AL", "AL", "CA", "TN", "NY", "NC")
name <- c("Brass; Smith", "Joe", "Christopher", "Bob; Holland", "Wayne", NA, "Joseph A. Freeland", "")
versus <- c("Carl Brass", "Joe", NA, NA, NA, NA, "Freeland", NA)
color <- c("Blue", "Red", NA, NA, NA, NA, "Yellow", NA)

df3 <- data.frame(year, state, name, versus, color)

I tried the following:

f <- function(n,v) {
  wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
  sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}

df4 <- left_join(df1, df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(f(name, versus), name, NA_character_))

But I keep getting this error message:

Error in `mutate()`:
! Problem while computing `versus = if_else(f(name, versus), name, NA_character_)`.
ℹ The error occurred in row 8.
Caused by error in `sum()`:
! invalid 'type' (list) of argument

I'm not sure what is going on and any help would be appreciated!

AlexisW · May 23, 2023, 1:22am

You can look at row 8 of the joined dataframe, and see the error happens because name = NA, and sapply() fails when wrds = character(0).

You can solve that simply in f() by first checking whether n is NA. You will also have the same problem if n = "" (which is the case later), so you can check for both of these possibilities:

f <- function(n,v) {
  if(is.na(n) || n == "") return(FALSE)
  wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
  sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}

yifanliu · May 23, 2023, 9:14am

I can understand that you want to perform a fuzzy match test on the df1 and df2 full-joined data frame. But the generated full-joined data frame has many unexpected rows, they are because df1 and df2 has many common "2002 TN" combinations, and they cannot be properly removed according to your peocess. Based on that, I recommend pre-test the fuzzy matching columns in df1 and then do the join with df2.

My recommended process is as follows. Slightly different from your code, I transformed the names in df1 into regex pattern directly to detect its fuzzy existence in df2.

f <- function(n, v) {
  wrds <- stringr::str_replace_all(n, "[^\\w\\b]+", "|")
  mtrx <- sapply(wrds, grep, x = v, perl = T, value = T)
  lst <- sapply(mtrx, \(x) {x[is.na(x)] <- ""; paste(x, collapse = "")})
  return(c(lst))
}

df1 %>% mutate(
  name = na_if(name,""),
  versus = f(name, df2$versus)
) %>% left_join(df2)

the results are pretty close to the df3:

Joining with `by = join_by(year, state, versus)`
  year state               name     versus  color
1 2002    TN       Brass; Smith Carl Brass   Blue
2 2002    TN                Joe        Joe    Red
3 1999    AL        Christopher              <NA>
4 1999    AL       Bob; Holland              <NA>
5 1997    CA              Wayne              <NA>
6 2002    TN               <NA>              <NA>
7 2005    NY Joseph A. Freeland   Freeland Yellow
8 2008    NC               <NA>              <NA>

system · June 13, 2023, 9:14am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.