Programming with `join_by`

I'm trying to write a function that performs a join of two user-supplied data sets. I don't know for sure what the by-variables will be or even how many the user will want.

I do know that in both data sets the variables to be joined on will have the same names. However, I can't just have my function do a natural join because there could be other variables that would end up being joined on as well.

I've seen online how to use {{ to include a captured name in join_by, but I haven't seen anywhere how to do this if the user had multiple by-variables.

There are some examples online of passing variable names through the dots, capturing those, and then using the old by= syntax to specify the join variables. But in my case I'm using the dots to pass arguments to another function.

Here's a short example showing my issue:

# Note: dots are used to forward extra arguments on to anti_join
my_join = function(dt1, dt2, join_vars, ...){
  dplyr::anti_join(dt1, dt2, by = dplyr::join_by({{join_vars}}), ...)
}

dt1 = tibble::as_tibble(mtcars, rownames = "model")
dt2 = dplyr::filter(dt1, stringr::str_detect(model, "^Merc"))

# specifying a single column works fine
my_join(dt1, dt2, join_vars = model)
#> # A tibble: 25 × 12
#>    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6   160   110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#>  3 Datsun 710   22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#>  4 Hornet 4 D…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#>  5 Hornet Spo…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#>  6 Valiant      18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
#>  7 Duster 360   14.3     8   360   245  3.21  3.57  15.8     0     0     3     4
#>  8 Cadillac F…  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#>  9 Lincoln Co…  10.4     8   460   215  3     5.42  17.8     0     0     3     4
#> 10 Chrysler I…  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
#> # ℹ 15 more rows

# specifying multiple columns doesn't work at all
my_join(dt1, dt2, join_vars = c(model, carb))
#> Error in `dplyr::join_by()`:
#> ! Expressions must use one of: `==`, `>=`, `>`, `<=`, `<`, `closest()`,
#>   `between()`, `overlaps()`, or `within()`.
#> ℹ Expression 1 is `c(model, carb)`.

Created on 2024-04-11 with reprex v2.1.0

do you need to use Non-Standard Evaluation ? it must be easier for you to solve your requirement if model and carb were strings rather than symbols

That's a good thought. Obviously, my actual function is more complex than the mtcars example and makes use of NSE elsewhere than the join. And this is part of a larger set of functions that all use various levels of NSE and interface with dplyr and ggplot2, etc. So, I'd prefer a consistent "look and feel". But I'd certainly consider the effort of re-imagining things to use strings vs. struggle to make the NSE work if that's what's needed.

If you are willing to use strings for variable names, the following seems to work.

my_join = function(dt1, dt2, join_vars, ...){
  dplyr::anti_join(dt1, dt2, by = join_vars, ...)
}
my_join(dt1, dt2, "model")
my_join(dt1, dt2, c("model", "carb"))

Another alternative is:

my_join = function(dt1, dt2, ...){
  dplyr::anti_join(dt1, dt2, by = join_by(...))
}
my_join(dt1, dt2, model)
my_join(dt1, dt2,  model, carb)

Thanks for supplying code for the string implementation of this function, and for an option using dots. Definitely both valid options to consider.

I'd prefer to stick to the NSE version if possible, in order to maintain a consistent "look and feel" with other functions and because I'm using NSE elsewhere in my actual use-case. And the dots, unfortunately, aren't an option as they're being use to pass arguments to a function call elsewhere within my function.

I've written a version of the function that does what I want and uses NSE, I'd welcome feedback.

my_join2 <- function(dt1, dt2, join_vars, ...) {
  join_vars <- rlang::enexpr(join_vars)
  jn_expr <- if (is.symbol(join_vars)) {
    rlang::expr(dplyr::join_by(!!join_vars))
  } else if (is.call(join_vars) && deparse(join_vars[[1]]) == "c") {
    rlang::expr(dplyr::join_by(!!!{
      as.list(join_vars)[-1]
    }))
  } else {
    stop("Invalid join_vars: Expected either a single variable (symbol) or a call to 'c()' with multiple variables.")
  }

  dplyr::anti_join(dt1, dt2, by = rlang::eval_tidy(jn_expr), ...)
}

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.