Keeping empty groups using summarize()

Hello, I have a tibble with an empty factor level and I am trying to use summarize() but preserve all factor levels. Normally I would just use group_by() and .drop = FALSE, but is there anyway of avoiding group_by() achieving this with .by and summarize()?

library(tidyverse)

health <- tibble(
  name   = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
  age    = c(34, 88, 75, 47, 56),
)

health |> 
  group_by(smoker, .drop = FALSE) |> 
  summarize(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)
)
#> Warning: There were 2 warnings in `summarize()`.
#> The first warning was:
#> ℹ In argument: `min_age = min(age)`.
#> ℹ In group 1: `smoker = yes`.
#> Caused by warning in `min()`:
#> ! no non-missing arguments to min; returning Inf
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#> # A tibble: 2 × 6
#>   smoker     n mean_age min_age max_age sd_age
#>   <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
#> 1 yes        0      NaN     Inf    -Inf   NA  
#> 2 no         5       60      34      88   21.6

health |>
  summarize(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age), .by = "smoker")
#> # A tibble: 1 × 6
#>   smoker     n mean_age min_age max_age sd_age
#>   <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
#> 1 no         5       60      34      88   21.6

Created on 2023-06-18 with reprex v2.0.2

1 Like

Hi @matthew-ru
Would it work for you to suppress the warnings, and tidy-up the output like this:

library(tidyverse)

health <- tibble(
  name   = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
  age    = c(34, 88, 75, 47, 56),
)
health
#> # A tibble: 5 × 3
#>   name    smoker   age
#>   <chr>   <fct>  <dbl>
#> 1 Ikaia   no        34
#> 2 Oletta  no        88
#> 3 Leriah  no        75
#> 4 Dashay  no        47
#> 5 Tresaun no        56

suppressWarnings(
health %>%  
  group_by(smoker, .drop = FALSE) %>%  
  summarize(
    n = n(),
    mean_age = mean(age),
    min_age = min(age),
    max_age = max(age),
    sd_age = sd(age)) %>% 
  mutate_at(vars(contains("age")), ~ ifelse(n == 0, NA, .))
)
#> # A tibble: 2 × 6
#>   smoker     n mean_age min_age max_age sd_age
#>   <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
#> 1 yes        0       NA      NA      NA   NA  
#> 2 no         5       60      34      88   21.6

Created on 2023-06-20 with reprex v2.0.2

Hi @DavoWW thank you for your reply, but I really just wanted to avoid using group_by() at all.

If you want to use dplyr/tidyverse; group_by is the best practice approach to ... grouped type summarisation.
You could look into alternative data manipulation like perhaps data.table though. Perhaps you would prefer that approach to grouping.

Just for fun I made an arbitrarily convoluted way, to achieve the same, but avoiding group by; I went further and alter the functionality of the mean/min/max/sd used so that they return 0 if called with no data as can be the case with smoker = yes.

library(tidyverse)
health <- tibble(
  name   = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
  age    = c(34, 88, 75, 47, 56),
)

health |> 
  split(~smoker,drop = FALSE) |> 
  imap_dfr(\(x,y)
  summarize(x,
    smoker = unique(y),
    n = n(),
  across(.cols=age,
         .fns=map(list(mean=mean,
                       min=min,
                       max=max,
                       sd=sd),function(func){
                         # wrap functions to return 0 
                         # if they are called without data
                         function(x){
                           if(length(x)==0){
                             return(0L)}
                           func(x)
                         }
                       }),
         .names = "{.fn}_{.col}")
  ))
1 Like

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.