So let's say I have a function f that returns the min, max, and median of a numerical vector x as a vector of 3 values.
f <- function(x) {
c( min(x), max(x), median(x) )
}
Is there a way to use f in a dplyr::summarise call to get these 3 statistics as independent variables as one would get using the following "full" call?
require(tidyverse)
set.seed(12345)
df <- data.frame(
x = rnorm(1000),
y = sample(1:2, 1000, replace = TRUE)
)
df %>%
group_by(y) %>%
summarise(
min = min(x),
max = max(x),
median = median(x)
)
Does something like this get to your desired output? I altered the function to paste together the three metrics (separated by a comma) and then used the separate() function to put each in its own column.
require(tidyverse)
set.seed(12345)
df <- data.frame(
x = rnorm(1000),
y = sample(1:2, 1000, replace = TRUE)
)
g = function(x) {
paste(min(x), max(x), median(x), sep = ',')
}
df %>%
group_by(y) %>%
summarise(
metrics = g(x),
.groups = 'drop'
) %>%
separate(metrics, into = c('min', 'max', 'median'), sep = ',')
#> # A tibble: 2 × 4
#> y min max median
#> <int> <chr> <chr> <chr>
#> 1 1 -2.56005244041801 3.33073330557046 0.0382097873805771
#> 2 2 -2.77832551031467 3.09369662832478 0.05072430823386
That could be one option... however, this 2-step approach has the disadvantage of coercing numeric values to character. Side-effects are to be expected.
Good point. I wasn't sure if you needed the values just for reporting purposes, but since you need them to stay numeric, then another step would be required to convert them back to numeric values.
UPDATE: See @KoderKow 's convert comment above. Thanks for the tip!
df %>%
group_by(y) %>%
summarise(
metrics = g(x),
.groups = 'drop'
) %>%
separate(metrics, into = c('min', 'max', 'median'), sep = ',') %>%
mutate(across(.cols = c('min', 'max', 'median'), as.numeric))
#> # A tibble: 2 × 4
#> y min max median
#> <int> <dbl> <dbl> <dbl>
#> 1 1 -2.56 3.33 0.0382
#> 2 2 -2.78 3.09 0.0507
The basic problem remains: data is being coerced from numeric to character... The back-end trickery just returns the data back to numeric class... which is nice but does not avoid the initial coercion and the potential associated risks.
This is another ad-hoc way to do this, which is also very tailored to the simple reprex I provided.
I now realize that I should have framed my question in its original intent, which is much broader than this simple reprex. My overall goal is to identify a general mechanism by which one can use any kind of "multi-output" f function and not create ad-hoc wrapper function like you suggested. One could do that using base R and aggregate but the output is not ideal either, as it requires a bit of post-processing (see example below). I was hoping that tidyverse would offer a cleaner framework for this.
f <- function(x) {
c( min = min(x), max = max(x), median = median(x) )
}
res <- aggregate(
df[, 'x', drop = FALSE],
by = df[, 'y', drop = FALSE],
f,
simplify = FALSE
)
res
dim(res)
res[,2]