Using dplyer to calculate the mean while keeping columns not included in group_by

joels · October 10, 2020, 4:19am

If you add the columns with a constant value to the group_by, then those will be kept. SampleDate and day won't normally be kept in a summary data frame, because we're averaging over them (I'll show a way around this in the second example). In the example below, I needed to update your sample data frame, as it's missing some commas and the year vector has the wrong length. Also, for constant columns, you can use a vector of length one. If you ever do need to repeat a value, you can save some typing with, for example, rep("chemx", 11).

library(tidyverse)
library(lubridate)

monthlymean <- data.frame(
  stringsAsFactors = FALSE,
  AnalyteName = "chemx",
  SampleDate = c("2012-01-09","2012-08-14",
                 "2013-05-21","2014-09-09","2010-12-28","2013-03-28",
                 "2010-10-12","2010-03-14","2010-09-16","2012-01-21","2010-10-30"),
  WBID = 12345,
  ProjectName = "Water",
  day = c(9L, 14L, 21L, 9L, 28L, 28L, 12L, 14L, 16L,21L, 30L),
  month = c(1L, 8L, 5L, 9L, 12L, 3L, 10L, 3L, 9L,1L,10L),
  year=c(2012,2012,2013,2014,2010,2013,2010,2010,2012,2010,2010),
  Result = c(0.23, 0.56, 9.56, 3.67, 2.67, 4.67, 2.56, 3.57, 2.45, 6.92, 2.56)
)

# Get columns with constant values
vars=names(monthlymean)[sapply(monthlymean, function(x) length(unique(x))==1)]

mms <- monthlymean %>% 
  group_by(across(c(vars, month, year, WBID))) %>% 
  summarise(Result = mean(Result, na.rm=TRUE))

mms
#> # A tibble: 10 x 6
#> # Groups:   AnalyteName, WBID, ProjectName, month [7]
#>    AnalyteName  WBID ProjectName month  year Result
#>    <chr>       <dbl> <chr>       <int> <dbl>  <dbl>
#>  1 chemx       12345 Water           1  2010   6.92
#>  2 chemx       12345 Water           1  2012   0.23
#>  3 chemx       12345 Water           3  2010   3.57
#>  4 chemx       12345 Water           3  2013   4.67
#>  5 chemx       12345 Water           5  2013   9.56
#>  6 chemx       12345 Water           8  2012   0.56
#>  7 chemx       12345 Water           9  2012   2.45
#>  8 chemx       12345 Water           9  2014   3.67
#>  9 chemx       12345 Water          10  2010   2.56
#> 10 chemx       12345 Water          12  2010   2.67

If you do want to carry SampleDate along in the summary data, one option is nesting. We do this by converting SampleDate into a list column within summarise. For example:

mms2 <- monthlymean %>% 
  group_by(across(c(vars, month, year, WBID))) %>% 
  summarise(Result = mean(Result, na.rm=TRUE),
            SampleDate = list(SampleDate))

mms2
#> # A tibble: 10 x 7
#> # Groups:   AnalyteName, WBID, ProjectName, month [7]
#>    AnalyteName  WBID ProjectName month  year Result SampleDate
#>    <chr>       <dbl> <chr>       <int> <dbl>  <dbl> <list>    
#>  1 chemx       12345 Water           1  2010   6.92 <chr [1]> 
#>  2 chemx       12345 Water           1  2012   0.23 <chr [1]> 
#>  3 chemx       12345 Water           3  2010   3.57 <chr [1]> 
#>  4 chemx       12345 Water           3  2013   4.67 <chr [1]> 
#>  5 chemx       12345 Water           5  2013   9.56 <chr [1]> 
#>  6 chemx       12345 Water           8  2012   0.56 <chr [1]> 
#>  7 chemx       12345 Water           9  2012   2.45 <chr [1]> 
#>  8 chemx       12345 Water           9  2014   3.67 <chr [1]> 
#>  9 chemx       12345 Water          10  2010   2.56 <chr [2]> 
#> 10 chemx       12345 Water          12  2010   2.67 <chr [1]>

You can unnest mm2 to show each SampleDate explicitly. But note that now we're back to the same number of rows as the original data fram, but Result is still the average value, and it's repeated in rows 9 and 10, which are part of the same grouping that we averaged over.

mms2 %>% unnest(SampleDate)
#> # A tibble: 11 x 7
#> # Groups:   AnalyteName, WBID, ProjectName, month [7]
#>    AnalyteName  WBID ProjectName month  year Result SampleDate
#>    <chr>       <dbl> <chr>       <int> <dbl>  <dbl> <chr>     
#>  1 chemx       12345 Water           1  2010   6.92 2012-01-21
#>  2 chemx       12345 Water           1  2012   0.23 2012-01-09
#>  3 chemx       12345 Water           3  2010   3.57 2010-03-14
#>  4 chemx       12345 Water           3  2013   4.67 2013-03-28
#>  5 chemx       12345 Water           5  2013   9.56 2013-05-21
#>  6 chemx       12345 Water           8  2012   0.56 2012-08-14
#>  7 chemx       12345 Water           9  2012   2.45 2010-09-16
#>  8 chemx       12345 Water           9  2014   3.67 2014-09-09
#>  9 chemx       12345 Water          10  2010   2.56 2010-10-12
#> 10 chemx       12345 Water          10  2010   2.56 2010-10-30
#> 11 chemx       12345 Water          12  2010   2.67 2010-12-28