How to simplify dplyr::summarise calls

Hi,

So let's say I have a function f that returns the min, max, and median of a numerical vector x as a vector of 3 values.

f <- function(x) {
  c( min(x), max(x), median(x) )
}

Is there a way to use f in a dplyr::summarise call to get these 3 statistics as independent variables as one would get using the following "full" call?

require(tidyverse)
set.seed(12345)
df <- data.frame(
  x = rnorm(1000),
  y = sample(1:2, 1000, replace = TRUE)
)
df %>% 
  group_by(y) %>% 
  summarise(
    min = min(x),
    max = max(x),
    median = median(x)
  )

Does something like this get to your desired output? I altered the function to paste together the three metrics (separated by a comma) and then used the separate() function to put each in its own column.

require(tidyverse)
set.seed(12345)

df <- data.frame(
  x = rnorm(1000),
  y = sample(1:2, 1000, replace = TRUE)
)

g = function(x) {
  paste(min(x), max(x), median(x), sep = ',')
}

df %>%
  group_by(y) %>%
  summarise(
    metrics = g(x),
    .groups = 'drop'
  ) %>%
  separate(metrics, into = c('min', 'max', 'median'), sep = ',')
#> # A tibble: 2 × 4
#>       y min               max              median            
#>   <int> <chr>             <chr>            <chr>             
#> 1     1 -2.56005244041801 3.33073330557046 0.0382097873805771
#> 2     2 -2.77832551031467 3.09369662832478 0.05072430823386

Created on 2023-01-03 with reprex v2.0.2.9000

Thanks @scottyd22

That could be one option... however, this 2-step approach has the disadvantage of coercing numeric values to character. Side-effects are to be expected.

Is it plausible to write the summarize call in the function f instead of using it within the summarize?

require(tidyverse)
#> Loading required package: tidyverse

set.seed(12345)

df <- data.frame(
  x = rnorm(1000),
  y = sample(1:2, 1000, replace = TRUE)
)

f <- function(.data, var) {
  d <- 
    .data %>% 
    summarise(
      min = min({{ var }}),
      max = max({{ var }}),
      median = median({{ var }}),
      .groups = "drop"
    )
  
  return(d)
}

df %>% 
  dplyr::group_by(y) %>% 
  f(x)
#> # A tibble: 2 x 4
#>       y   min   max median
#>   <int> <dbl> <dbl>  <dbl>
#> 1     1 -2.56  3.33 0.0382
#> 2     2 -2.78  3.09 0.0507

Created on 2023-01-03 by the reprex package (v2.0.1)

Edit: Modified the function to take a wanted column.

You can add convert = TRUE to seperate for @scottyd22's solution to convert character to numeric automatically :slight_smile:

1 Like

Good point. I wasn't sure if you needed the values just for reporting purposes, but since you need them to stay numeric, then another step would be required to convert them back to numeric values.

UPDATE: See @KoderKow 's convert comment above. Thanks for the tip!

df %>%
  group_by(y) %>%
  summarise(
    metrics = g(x),
    .groups = 'drop'
  ) %>%
  separate(metrics, into = c('min', 'max', 'median'), sep = ',') %>%
  mutate(across(.cols = c('min', 'max', 'median'), as.numeric))
#> # A tibble: 2 × 4
#>       y   min   max median
#>   <int> <dbl> <dbl>  <dbl>
#> 1     1 -2.56  3.33 0.0382
#> 2     2 -2.78  3.09 0.0507
1 Like

The basic problem remains: data is being coerced from numeric to character... The back-end trickery just returns the data back to numeric class... which is nice but does not avoid the initial coercion and the potential associated risks.

This is another ad-hoc way to do this, which is also very tailored to the simple reprex I provided.

I now realize that I should have framed my question in its original intent, which is much broader than this simple reprex. My overall goal is to identify a general mechanism by which one can use any kind of "multi-output" f function and not create ad-hoc wrapper function like you suggested. One could do that using base R and aggregate but the output is not ideal either, as it requires a bit of post-processing (see example below). I was hoping that tidyverse would offer a cleaner framework for this.

f <- function(x) {
  c( min = min(x), max = max(x), median = median(x) )
}


res <- aggregate(
  df[, 'x', drop = FALSE],
  by = df[, 'y', drop = FALSE],
  f,
  simplify = FALSE
)
res

dim(res)
res[,2]

How about

df %>%
  group_by(y) %>%
  summarise(x = list(f(x)))

I endorse michaelbgarcia's approach; it can be extended with unnest_wider()

f <- function(x) {
  c( min = min(x), 
     max = max(x), 
     median = median(x) )
}

  require(tidyverse)
set.seed(12345)
df <- data.frame(
  x = rnorm(1000),
  y = sample(1:2, 1000, replace = TRUE)
)

df |>
  group_by(y) |>
  summarise(x = list(f(x))) |> 
  unnest_wider(col=x)
# A tibble: 2 x 4
      y   min   max median
  <int> <dbl> <dbl>  <dbl>
1     1 -2.56  3.33 0.0382
2     2 -2.78  3.09 0.0507
2 Likes

Beautiful !

Thank you @michaelbgarcia and @nirgrahamuk. That's exactly what I was looking for!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.