Hello, I have a tibble with an empty factor level and I am trying to use summarize() but preserve all factor levels. Normally I would just use group_by() and .drop = FALSE, but is there anyway of avoiding group_by() achieving this with .by and summarize()?
library(tidyverse)
health <- tibble(
name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
age = c(34, 88, 75, 47, 56),
)
health |>
group_by(smoker, .drop = FALSE) |>
summarize(
n = n(),
mean_age = mean(age),
min_age = min(age),
max_age = max(age),
sd_age = sd(age)
)
#> Warning: There were 2 warnings in `summarize()`.
#> The first warning was:
#> ℹ In argument: `min_age = min(age)`.
#> ℹ In group 1: `smoker = yes`.
#> Caused by warning in `min()`:
#> ! no non-missing arguments to min; returning Inf
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#> # A tibble: 2 × 6
#> smoker n mean_age min_age max_age sd_age
#> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 yes 0 NaN Inf -Inf NA
#> 2 no 5 60 34 88 21.6
health |>
summarize(
n = n(),
mean_age = mean(age),
min_age = min(age),
max_age = max(age),
sd_age = sd(age), .by = "smoker")
#> # A tibble: 1 × 6
#> smoker n mean_age min_age max_age sd_age
#> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 no 5 60 34 88 21.6
If you want to use dplyr/tidyverse; group_by is the best practice approach to ... grouped type summarisation.
You could look into alternative data manipulation like perhaps data.table though. Perhaps you would prefer that approach to grouping.
Just for fun I made an arbitrarily convoluted way, to achieve the same, but avoiding group by; I went further and alter the functionality of the mean/min/max/sd used so that they return 0 if called with no data as can be the case with smoker = yes.
library(tidyverse)
health <- tibble(
name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
age = c(34, 88, 75, 47, 56),
)
health |>
split(~smoker,drop = FALSE) |>
imap_dfr(\(x,y)
summarize(x,
smoker = unique(y),
n = n(),
across(.cols=age,
.fns=map(list(mean=mean,
min=min,
max=max,
sd=sd),function(func){
# wrap functions to return 0
# if they are called without data
function(x){
if(length(x)==0){
return(0L)}
func(x)
}
}),
.names = "{.fn}_{.col}")
))