Can I apply an arbitrary filter within summarize?

daranzolin · May 18, 2023, 9:41pm

Consider the examples below using SQL and the hacksaw package. Is there a more 'canonical' and 'tidy' way to get this result?

library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
library(hacksaw)
library(tidyverse)

# With SQL I can do this:
sqldf("select cyl, 
              avg(mpg) AS mean_mpg_total,
              avg(mpg) filter (where hp > 110) AS mean_mpg_high_hp
      from mtcars 
      group by cyl") 
#>   cyl mean_mpg_total mean_mpg_high_hp
#> 1   4       26.66364             30.4
#> 2   6       19.74286             18.9
#> 3   8       15.10000             15.1

# With hacksaw I can do this:
# (But if feels esoteric)
mtcars |> 
  group_by(cyl) |> 
  eval_split(
    select(everything()),
    filter(hp > 110)
  ) |> 
  map(~summarize(.x, mean_mpg = mean(mpg))) |> 
  reduce(left_join, by = "cyl", suffix = c("_total", "_high_hp"))
#> # A tibble: 3 × 3
#>     cyl mean_mpg_total mean_mpg_high_hp
#>   <dbl>          <dbl>            <dbl>
#> 1     4           26.7             30.4
#> 2     6           19.7             18.9
#> 3     8           15.1             15.1

# Is there a canonical 'tidy' way?
mtcars |> 
  group_by(cyl) |> 
  summarize(
    mean_mpg_total = mean(mpg),
    # what goes here??
  )
#> # A tibble: 3 × 2
#>     cyl mean_mpg_total
#>   <dbl>          <dbl>
#> 1     4           26.7
#> 2     6           19.7
#> 3     8           15.1

^{Created on 2023-05-18 with reprex v2.0.2}

technocrat · May 19, 2023, 2:10am

Earlier, I offered a base version, but blew past the high hp case. Here it is again.

d <- data.frame(
  cyl = c(4,6,8),
  mean_total = c(
    mean(mtcars[which(mtcars$cyl == 4),"mpg"]),
    mean(mtcars[which(mtcars$cyl == 6),"mpg"]),
    mean(mtcars[which(mtcars$cyl == 8),"mpg"])),
  mean_high_hp = c(
    mean(highs[which(highs$cyl == 4),"mpg"]),
    mean(highs[which(highs$cyl == 6),"mpg"]),
    mean(highs[which(highs$cyl == 8),"mpg"]))
  )
d
#>   cyl mean_total mean_high_hp
#> 1   4   26.66364         30.4
#> 2   6   19.74286         18.9
#> 3   8   15.10000         15.1

DavoWW · May 19, 2023, 8:43am

Hi @daranzolin
Your example can be solved by simple sub-setting (or am I missing something?). But if the function you want to apply to the groups is more complex then you can use group_modify(). In this reprex I have stuck with summarise().

suppressPackageStartupMessages(library(tidyverse))
mtcars %>% 
  group_by(cyl) %>% 
  summarise(mean_total = mean(mpg),
            mean_high_hp = mean(mpg[hp > 110]))
#> # A tibble: 3 × 3
#>     cyl mean_total mean_high_hp
#>   <dbl>      <dbl>        <dbl>
#> 1     4       26.7         30.4
#> 2     6       19.7         18.9
#> 3     8       15.1         15.1

mtcars %>% 
  group_by(cyl) %>% 
  group_modify(~ { .x %>% 
    summarise(mean_total = mean(mpg),
              mean_high_hp = mean(mpg[hp > 110])) })
#> # A tibble: 3 × 3
#> # Groups:   cyl [3]
#>     cyl mean_total mean_high_hp
#>   <dbl>      <dbl>        <dbl>
#> 1     4       26.7         30.4
#> 2     6       19.7         18.9
#> 3     8       15.1         15.1

^{Created on 2023-05-19 with reprex v2.0.2}

system · May 26, 2023, 8:43am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.