dplyr joins: multiple = "none"

jdonland · February 3, 2024, 9:09pm

The multiple argument of dplyr joins has four options for handling multiple matches: "all", "any", "first", "last". Why not "none", i.e., when a unique match exists, use it, and when there are multiple, fill whatever columns we were getting with NAs? Is there a straightforward way to get this behaviour? For complex join criteria, it can be hard to determine in advance whether there will be multiple matches for a given row or not.

mattbrown · February 7, 2024, 2:03pm

(Sorry, I misread your requirements and posted an irrelevant answer previously... then panicked and deleted it instead of just editing it)

I can't comment on why there is no "none" option, but you can get this behaviour with a summarise following the join. You can join using the default "all" and then collapse groups with more than 1 row. Targetting the specific columns you want replaced with NAs may be a bit trickier, depending on your data, but something like the following should get you started:

library(dplyr)

df1 <- data.frame(a = 1:5)

df2 <- data.frame(a = c(1, 2, 2, 3, 4, 5, 5),
                b = letters[1:7])

# I split it into two sections for comparison purposes
df3_all <- left_join(df1, df2, 
                     by = "a")

df3_none <- df3_all %>% 
  summarise(across(.cols = everything(),
                   .fns = ~ifelse(n() == 1, .x, NA)),
                   .by = a)

I don't know how complex your joins are, but for simply joining on a few variables, this should work.

system · February 28, 2024, 2:03pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.