How to simplify dplyr::summarise calls

pomchip · January 3, 2023, 5:22pm

Hi,

So let's say I have a function f that returns the min, max, and median of a numerical vector x as a vector of 3 values.

f <- function(x) {
  c( min(x), max(x), median(x) )
}

Is there a way to use f in a dplyr::summarise call to get these 3 statistics as independent variables as one would get using the following "full" call?

require(tidyverse)
set.seed(12345)
df <- data.frame(
  x = rnorm(1000),
  y = sample(1:2, 1000, replace = TRUE)
)
df %>% 
  group_by(y) %>% 
  summarise(
    min = min(x),
    max = max(x),
    median = median(x)
  )

scottyd22 · January 3, 2023, 5:44pm

Does something like this get to your desired output? I altered the function to paste together the three metrics (separated by a comma) and then used the separate() function to put each in its own column.

require(tidyverse)
set.seed(12345)

df <- data.frame(
  x = rnorm(1000),
  y = sample(1:2, 1000, replace = TRUE)
)

g = function(x) {
  paste(min(x), max(x), median(x), sep = ',')
}

df %>%
  group_by(y) %>%
  summarise(
    metrics = g(x),
    .groups = 'drop'
  ) %>%
  separate(metrics, into = c('min', 'max', 'median'), sep = ',')
#> # A tibble: 2 × 4
#>       y min               max              median            
#>   <int> <chr>             <chr>            <chr>             
#> 1     1 -2.56005244041801 3.33073330557046 0.0382097873805771
#> 2     2 -2.77832551031467 3.09369662832478 0.05072430823386

Created on 2023-01-03 with reprex v2.0.2.9000

pomchip · January 3, 2023, 6:20pm

Thanks @scottyd22

That could be one option... however, this 2-step approach has the disadvantage of coercing numeric values to character. Side-effects are to be expected.

KoderKow · January 3, 2023, 6:23pm

Is it plausible to write the summarize call in the function f instead of using it within the summarize?

require(tidyverse)
#> Loading required package: tidyverse

set.seed(12345)

df <- data.frame(
  x = rnorm(1000),
  y = sample(1:2, 1000, replace = TRUE)
)

f <- function(.data, var) {
  d <- 
    .data %>% 
    summarise(
      min = min({{ var }}),
      max = max({{ var }}),
      median = median({{ var }}),
      .groups = "drop"
    )
  
  return(d)
}

df %>% 
  dplyr::group_by(y) %>% 
  f(x)
#> # A tibble: 2 x 4
#>       y   min   max median
#>   <int> <dbl> <dbl>  <dbl>
#> 1     1 -2.56  3.33 0.0382
#> 2     2 -2.78  3.09 0.0507

^{Created on 2023-01-03 by the reprex package (v2.0.1)}

Edit: Modified the function to take a wanted column.

KoderKow · January 3, 2023, 6:24pm

You can add convert = TRUE to seperate for @scottyd22's solution to convert character to numeric automatically

scottyd22 · January 3, 2023, 6:27pm

Good point. I wasn't sure if you needed the values just for reporting purposes, but since you need them to stay numeric, then another step would be required to convert them back to numeric values.

UPDATE: See @KoderKow 's convert comment above. Thanks for the tip!

df %>%
  group_by(y) %>%
  summarise(
    metrics = g(x),
    .groups = 'drop'
  ) %>%
  separate(metrics, into = c('min', 'max', 'median'), sep = ',') %>%
  mutate(across(.cols = c('min', 'max', 'median'), as.numeric))
#> # A tibble: 2 × 4
#>       y   min   max median
#>   <int> <dbl> <dbl>  <dbl>
#> 1     1 -2.56  3.33 0.0382
#> 2     2 -2.78  3.09 0.0507

pomchip · January 3, 2023, 7:09pm

The basic problem remains: data is being coerced from numeric to character... The back-end trickery just returns the data back to numeric class... which is nice but does not avoid the initial coercion and the potential associated risks.

pomchip · January 3, 2023, 7:24pm

This is another ad-hoc way to do this, which is also very tailored to the simple reprex I provided.

I now realize that I should have framed my question in its original intent, which is much broader than this simple reprex. My overall goal is to identify a general mechanism by which one can use any kind of "multi-output" f function and not create ad-hoc wrapper function like you suggested. One could do that using base R and aggregate but the output is not ideal either, as it requires a bit of post-processing (see example below). I was hoping that tidyverse would offer a cleaner framework for this.

f <- function(x) {
  c( min = min(x), max = max(x), median = median(x) )
}


res <- aggregate(
  df[, 'x', drop = FALSE],
  by = df[, 'y', drop = FALSE],
  f,
  simplify = FALSE
)
res

dim(res)
res[,2]

michaelbgarcia · January 4, 2023, 3:20am

How about

df %>%
  group_by(y) %>%
  summarise(x = list(f(x)))

nirgrahamuk · January 4, 2023, 9:46am

I endorse michaelbgarcia's approach; it can be extended with unnest_wider()

f <- function(x) {
  c( min = min(x), 
     max = max(x), 
     median = median(x) )
}

  require(tidyverse)
set.seed(12345)
df <- data.frame(
  x = rnorm(1000),
  y = sample(1:2, 1000, replace = TRUE)
)

df |>
  group_by(y) |>
  summarise(x = list(f(x))) |> 
  unnest_wider(col=x)

# A tibble: 2 x 4
      y   min   max median
  <int> <dbl> <dbl>  <dbl>
1     1 -2.56  3.33 0.0382
2     2 -2.78  3.09 0.0507

pomchip · January 4, 2023, 11:32am

Beautiful !

Thank you @michaelbgarcia and @nirgrahamuk. That's exactly what I was looking for!

system · January 11, 2023, 11:33am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.