Using dplyer to calculate the mean while keeping columns not included in group_by

Micaela · October 9, 2020, 9:31pm

I am trying to calculate the mean of my results column based on month, year, and WBID. There are several other columns in my dataset that I need to maintain so that I may row bind my aggregated dataset to another dataset with the same columns. Here is an example of my dataset:

library(dplyr)
library(lubridate)

# Sample data on a copy/paste friendly format
monthlymean <- data.frame(
  stringsAsFactors = FALSE,
      AnalyteName = c("chemx","chemx","chemx",
                       "chemx","chemx","chemx","chemx","chemx","chemx","chemx","chemx"),
        SampleDate = c("2012-01-09","2012-08-14",
                       "2013-05-21","2014-09-09","2010-12-28","2013-03-28",
                       "2010-10-12","2010-03-14","2010-09-16","2012-01-21","2010-10-30"),
        WBID = c(12345, 
            12345,12345,12345,12345,12345,12345,12345,12345,12345,12345),
       ProjectName = c("Water", "Water", "Water", "Water", "Water", "Water", "Water", 
        "Water", "Water", "Water", "Water")
               day = c(9L, 14L, 21L, 9L, 28L, 28L, 12L, 14L, 16L,21L, 30L),
             month = c(1L, 8L, 5L, 9L, 12L, 3L, 10L, 3L, 9L,1L,10L)
             year=c(2012,2012,2013,2014,2010,2013,2010,2010,2012,2010),
        Result = c(0.23, 0.56, 9.56, 3.67, 2.67, 4.67, 2.56, 3.57, 2.45, 6.92, 2.56)
)

I have tried the following code, but it only returns a results column along with the other variables included in group_by()

monthlymean <- monthlymean %>% 
  select(AnalyteName, SampleDate, WBID, ProjectName, Result, day, month, year) %>%
  group_by(month, year, WBID) %>% 
  summarise_at(vars(Result), mean, na.rm=TRUE)

For this example I should end up with a dataset with 9 rows and 8 columns.

siddharthprabhu · October 10, 2020, 3:51am

To end up with 9 rows while keeping all the columns, you'll need to discard two observations. Which ones should be removed?

joels · October 10, 2020, 4:19am

If you add the columns with a constant value to the group_by, then those will be kept. SampleDate and day won't normally be kept in a summary data frame, because we're averaging over them (I'll show a way around this in the second example). In the example below, I needed to update your sample data frame, as it's missing some commas and the year vector has the wrong length. Also, for constant columns, you can use a vector of length one. If you ever do need to repeat a value, you can save some typing with, for example, rep("chemx", 11).

library(tidyverse)
library(lubridate)

monthlymean <- data.frame(
  stringsAsFactors = FALSE,
  AnalyteName = "chemx",
  SampleDate = c("2012-01-09","2012-08-14",
                 "2013-05-21","2014-09-09","2010-12-28","2013-03-28",
                 "2010-10-12","2010-03-14","2010-09-16","2012-01-21","2010-10-30"),
  WBID = 12345,
  ProjectName = "Water",
  day = c(9L, 14L, 21L, 9L, 28L, 28L, 12L, 14L, 16L,21L, 30L),
  month = c(1L, 8L, 5L, 9L, 12L, 3L, 10L, 3L, 9L,1L,10L),
  year=c(2012,2012,2013,2014,2010,2013,2010,2010,2012,2010,2010),
  Result = c(0.23, 0.56, 9.56, 3.67, 2.67, 4.67, 2.56, 3.57, 2.45, 6.92, 2.56)
)

# Get columns with constant values
vars=names(monthlymean)[sapply(monthlymean, function(x) length(unique(x))==1)]

mms <- monthlymean %>% 
  group_by(across(c(vars, month, year, WBID))) %>% 
  summarise(Result = mean(Result, na.rm=TRUE))

mms
#> # A tibble: 10 x 6
#> # Groups:   AnalyteName, WBID, ProjectName, month [7]
#>    AnalyteName  WBID ProjectName month  year Result
#>    <chr>       <dbl> <chr>       <int> <dbl>  <dbl>
#>  1 chemx       12345 Water           1  2010   6.92
#>  2 chemx       12345 Water           1  2012   0.23
#>  3 chemx       12345 Water           3  2010   3.57
#>  4 chemx       12345 Water           3  2013   4.67
#>  5 chemx       12345 Water           5  2013   9.56
#>  6 chemx       12345 Water           8  2012   0.56
#>  7 chemx       12345 Water           9  2012   2.45
#>  8 chemx       12345 Water           9  2014   3.67
#>  9 chemx       12345 Water          10  2010   2.56
#> 10 chemx       12345 Water          12  2010   2.67

If you do want to carry SampleDate along in the summary data, one option is nesting. We do this by converting SampleDate into a list column within summarise. For example:

mms2 <- monthlymean %>% 
  group_by(across(c(vars, month, year, WBID))) %>% 
  summarise(Result = mean(Result, na.rm=TRUE),
            SampleDate = list(SampleDate))

mms2
#> # A tibble: 10 x 7
#> # Groups:   AnalyteName, WBID, ProjectName, month [7]
#>    AnalyteName  WBID ProjectName month  year Result SampleDate
#>    <chr>       <dbl> <chr>       <int> <dbl>  <dbl> <list>    
#>  1 chemx       12345 Water           1  2010   6.92 <chr [1]> 
#>  2 chemx       12345 Water           1  2012   0.23 <chr [1]> 
#>  3 chemx       12345 Water           3  2010   3.57 <chr [1]> 
#>  4 chemx       12345 Water           3  2013   4.67 <chr [1]> 
#>  5 chemx       12345 Water           5  2013   9.56 <chr [1]> 
#>  6 chemx       12345 Water           8  2012   0.56 <chr [1]> 
#>  7 chemx       12345 Water           9  2012   2.45 <chr [1]> 
#>  8 chemx       12345 Water           9  2014   3.67 <chr [1]> 
#>  9 chemx       12345 Water          10  2010   2.56 <chr [2]> 
#> 10 chemx       12345 Water          12  2010   2.67 <chr [1]>

You can unnest mm2 to show each SampleDate explicitly. But note that now we're back to the same number of rows as the original data fram, but Result is still the average value, and it's repeated in rows 9 and 10, which are part of the same grouping that we averaged over.

mms2 %>% unnest(SampleDate)
#> # A tibble: 11 x 7
#> # Groups:   AnalyteName, WBID, ProjectName, month [7]
#>    AnalyteName  WBID ProjectName month  year Result SampleDate
#>    <chr>       <dbl> <chr>       <int> <dbl>  <dbl> <chr>     
#>  1 chemx       12345 Water           1  2010   6.92 2012-01-21
#>  2 chemx       12345 Water           1  2012   0.23 2012-01-09
#>  3 chemx       12345 Water           3  2010   3.57 2010-03-14
#>  4 chemx       12345 Water           3  2013   4.67 2013-03-28
#>  5 chemx       12345 Water           5  2013   9.56 2013-05-21
#>  6 chemx       12345 Water           8  2012   0.56 2012-08-14
#>  7 chemx       12345 Water           9  2012   2.45 2010-09-16
#>  8 chemx       12345 Water           9  2014   3.67 2014-09-09
#>  9 chemx       12345 Water          10  2010   2.56 2010-10-12
#> 10 chemx       12345 Water          10  2010   2.56 2010-10-30
#> 11 chemx       12345 Water          12  2010   2.67 2010-12-28

Micaela · October 12, 2020, 6:30pm

None should be "discarded" but the ones with the same month and year should be averaged. Since the WBID will be the same for the entries that are averaged, the only column with different data should be the SampleDate, which can be given as one of the dates or as an arbitrary date at the first or last day of the month and year that the two entries are averaged within.

Micaela · October 12, 2020, 6:40pm

Before attaching this data frame to the other one I am using I nullify month, year, and day as I added them for assessing the monthly mean so it is no problem if the day column is lost. As for sample date, the date itself does not really matter. It could be one of the two dates, or even the first or last date of the month over which the results are being averaged. I don't think the nested version will work for me as the point of this code is to count exceedances in water quality objectives after assessing them over their averaging periods. Having both rows with repeated data would result in two counts if the result exceeded my objective value.

joels · October 12, 2020, 7:10pm

You can count exceedances in the same summary operation:

cutoff = 5.0

mms <- monthlymean %>% 
  group_by(across(c(vars, month, year, WBID))) %>% 
  summarise(Num.exceed = sum(Result > cutoff, na.rm=TRUE),
            Pct.exceed = Num.exceed/sum(!is.na(Result)),
            Result = mean(Result, na.rm=TRUE))

Micaela · October 12, 2020, 8:10pm

I created a data frame that is much simpler than what I am actually working with. The problem about counting exceedances here is that in my actual data frame I have multiple analytes with multiple objective values. These objectives are also site specific, so the waterbody may have multiple objectives for the same waterbody depending on time of year that all need to be summed. My code is already equipped to do this, I just need to figure out how to average the results while holding a placeholder date so I am able to row bind it to my larger dataset that counts the exceedances for all data. Maybe there is a way to assign the first of the month as a date for rows which have been averaged? I am working with code that is over 25000 lines in length, so I am trying to construct this new section to work with the rest of what is already written, and I think this would be the best approach.

joels · October 12, 2020, 8:15pm

It's hard to be certain without seeing your actual data, but from what you're describing, it sounds like a join would be more appropriate (and safer) than rowbinding. If all that matters is the year and month for a given summary row, then you could join the two data frames by year and month as the join keys (plus any other necessary key columns, such as analyte), rather than creating a fake date value.

Micaela · October 12, 2020, 8:18pm

Ok! I'm working on another project at the moment, so it may take me a few days to read up on join and try it out. I will get back to you as soon as I have the time to test your suggestions! Thank you so much for the help!

joels · October 12, 2020, 8:51pm

See the tidyverse help on joins for the basics. Also, be aware that joining data by key columns is a general database/data science concept. This is just the tidyverse implementation. R has other ways of doing joins between data frames (e.g., the base merge function; the data.table package has a variety of joins (more than are available in the tidyverse) as part of its functionality), and other languages have their own join implementations.

Micaela · October 13, 2020, 11:26pm

I just tried running your first chunk of code and get an error concerning the function "across." Do you know if I need another library?

joels · October 13, 2020, 11:30pm

across (and more generally an updated system for "tidy evaluation") was added with dplyr 1.0. If you've already installed the latest version of dplyr, please post the exact code you ran and I'll see if I can figure out what went wrong.

Micaela · October 13, 2020, 11:55pm

Ok, I updated dplyer and it ran fine. It appears that this code only keeps columns for which all rows are constant though. I built a table that only has one analyte with its repeated parameters to simplify, by my actual data frame contains several AnalyteNames with unique WBID and ProjectNames. Thus why I include the WBID in my grouping. I apologize if this was an oversimplification.

joels · October 14, 2020, 12:01am

No problem. If you need further help, just let us know.

Micaela · October 14, 2020, 12:29am

Do you have any ideas on how to maintain the columns with data that is not being used in calculating the mean in a circumstance where all row values are not constant? It is fine if SampleDate and day are lost. This may be a better representation of my data. The resulting data frame should contain 20 rows and 6 columns.

library(tidyverse)
library(lubridate)

monthlymean <- data.frame(
  stringsAsFactors = FALSE,
  AnalyteName = rep("chemx",11), rep("chemb",11)
  SampleDate = c("2012-01-09","2012-08-14",
                 "2013-05-21","2014-09-09","2010-12-28","2013-03-28",
                 "2010-10-12","2010-03-14","2010-09-16","2012-01-21","2010-10-30", "2012-01-09","2012-08-14",
                 "2013-05-21","2014-09-09","2010-12-28","2013-03-28",
                 "2010-10-12","2010-03-14","2010-09-16","2012-01-21","2010-10-30"),
  WBID = rep(12345,11), rep(54321,11)
  ProjectName = rep("Water", 11), rep("Water1",11)
  day = c(9L, 14L, 21L, 9L, 28L, 28L, 12L, 14L, 16L,21L, 30L, 9L, 14L, 21L, 9L, 28L, 28L, 12L, 14L, 16L,21L, 30L),
  month = c(1L, 8L, 5L, 9L, 12L, 3L, 10L, 3L, 9L,1L,10L, 1L, 8L, 5L, 9L, 12L, 3L, 10L, 3L, 9L,1L,10L),
  year=c(2012,2012,2013,2014,2010,2013,2010,2010,2012,2010,2010, 2012,2012,2013,2014,2010,2013,2010,2010,2012,2010,2010),
  Result = c(0.23, 0.56, 9.56, 3.67, 2.67, 4.67, 2.56, 3.57, 2.45, 6.92, 2.56, 0.23, 0.56, 9.56, 3.67, 2.67, 4.67, 2.56, 3.57, 2.45, 6.92, 2.56)
)

[/quote]

joels · October 14, 2020, 12:54am

Any column in group_by will be carried through the summarise calculation.

In your example data, two or more runs of rep need to be wrapped in c() to generate a single vector and avoid an error: c(rep(12345,11), rep(54321,11)). Also, you can shorten this with rep(c(12345, 54321), each=11).

Is something like this what you had in mind:

cutoff = 5

monthlymean %>% 
  group_by(AnalyteName, WBID, ProjectName, month, year) %>% 
  summarise(n.days = sum(!is.na(Result)),
            n.exceed = sum(Result >= cutoff),
            pct.exceed = n.exceed/n.days * 100,
            mean.result=mean(Result, na.rm=TRUE))

   AnalyteName  WBID ProjectName month  year n.days n.exceed pct.exceed mean.result
   <chr>       <dbl> <chr>       <int> <dbl>  <int>    <int>      <dbl>       <dbl>
 1 chemb       54321 Water1          1  2010      1        1        100        6.92
 2 chemb       54321 Water1          1  2012      1        0          0        0.23
 3 chemb       54321 Water1          3  2010      1        0          0        3.57
 4 chemb       54321 Water1          3  2013      1        0          0        4.67
 5 chemb       54321 Water1          5  2013      1        1        100        9.56
 6 chemb       54321 Water1          8  2012      1        0          0        0.56
 7 chemb       54321 Water1          9  2012      1        0          0        2.45
 8 chemb       54321 Water1          9  2014      1        0          0        3.67
 9 chemb       54321 Water1         10  2010      2        0          0        2.56
10 chemb       54321 Water1         12  2010      1        0          0        2.67
11 chemx       12345 Water           1  2010      1        1        100        6.92
12 chemx       12345 Water           1  2012      1        0          0        0.23
13 chemx       12345 Water           3  2010      1        0          0        3.57
14 chemx       12345 Water           3  2013      1        0          0        4.67
15 chemx       12345 Water           5  2013      1        1        100        9.56
16 chemx       12345 Water           8  2012      1        0          0        0.56
17 chemx       12345 Water           9  2012      1        0          0        2.45
18 chemx       12345 Water           9  2014      1        0          0        3.67
19 chemx       12345 Water          10  2010      2        0          0        2.56
20 chemx       12345 Water          12  2010      1        0          0        2.67

system · October 21, 2020, 12:54am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.