When to use c_across() instead of across()?

siddharthprabhu · July 31, 2020, 8:59am

There is a key difference between the way these two functions operate; sum() takes ... as arguments while sd() takes a single vector (so does mean()).

args(sum)
#> function (..., na.rm = FALSE) 
#> NULL
args(sd)
#> function (x, na.rm = FALSE) 
#> NULL

^{Created on 2020-07-31 by the reprex package (v0.3.0)}

I think the reason why one works but not the other has to do with how across() and c_across() splice arguments. Since across() is designed for column-wise transformations, the transformed variables are returned in a list which is then spliced (ref: lines 112 to 134 in across.R). This obviously isn't required for c_across().

This can also be seen in the error message generated when using across() with sd().

library(tidyverse)

df <- tibble(id = 1:4, w = runif(4), x = runif(4), y = runif(4), z = runif(4))

df %>%
  mutate(
    sd  = sd(across(w:z))
  )
#> Error: Problem with `mutate()` input `sd`.
#> x 'list' object cannot be coerced to type 'double'
#> i Input `sd` is `sd(across(w:z))`.

^{Created on 2020-07-31 by the reprex package (v0.3.0)}

Makes sense if across() is returning a list since sd() expects a numeric vector. I would stick to c_across() for making selections to avoid running into this type of error.

Disclaimer: I'm stretching my knowledge of the tidyverse here so I can't say for sure whether this reasoning is right. Just trying to work it out as best as I can.