Tricky behavior with grouped summarize sum for groups with only NAs

Hi,

I know that sum(NA, na.rm = TRUE) results in 0. However, I wouldn't necessarily want this 0 to appear in a summarise() output. Please see the minimal example below.

tbl <- tibble(id = c(1,2), values = c(NA, 5))

tbl %>% group_by(id) %>% summarise(tot = sum(values, na.rm = TRUE))
#> # A tibble: 2 × 2
#>      id   tot
#>   <dbl> <dbl>
#> 1     1     0
#> 2     2     5

Subsequent calculations using the tot column would be wrong (e.g., mean of tot). I know this is a symptom of the sum() function, but was wondering if there is a "safer" way of doing such summarize? I suppose one approach is an initial test, e.g.,

tbl %>% group_by(id) %>% summarise(tot = if_else(all(is.na(values)), NA, sum(values, na.rm = TRUE)))

Are there better ways to avoid this potential pitfall? Thank you in advance for any feedback and guidance.

The mean function has an na.rm option as well.

> tbl <- tibble(id = c(1,2), values = c(NA, 5))
> 
> tbl %>%
>   group_by(id) %>% summarise(
>     tot_sum = sum(values, na.rm = TRUE),
>     tot_mean = mean(values, na.rm = TRUE)
>   )

# A tibble: 2 x 3
     id tot_sum tot_mean
  <dbl>   <dbl>    <dbl>
1     1       0      NaN
2     2       5        5

for accurate downstream calculations, simple use the NA , as there purpose of existence is to work in the way you would require.

Displaying nicely formatted results for users should happen downstream of your calculations; you can decide on appropriate way to print the NA info depending on that.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.