`summarise()` has grouped output by 'X'. You can override using the `.groups` argument.

This has been bothering me for a long time, and I've never been able to gain an understanding of what is going on here.

We've all seen it. You do a summarise in a group_by and you get this message...

`summarise()` has grouped output by 'X'. You can override using the `.groups` argument.

The more I think about this, the more troubled I get. There must be a reason for this message. I struggle to understand what it's trying to tell me. And yes, I realize it's harmless but this bugs me and I want to understand what the point of these messages truly are.

Here's an example, taken from youtube:
Let's first create a tibble...

library(tidyverse)

data <- tibble(gr1    = rep(LETTERS[1:4], each = 3),
               gr2    = rep(letters[1:2], times = 6),
               values = 101:112)
data 

OK, nothing fancy. Here it is...

#> # A tibble: 12 × 3
#>    gr1   gr2   values
#>    <chr> <chr>  <int>
#>  1 A     a        101
#>  2 A     b        102
#>  3 A     a        103
#>  4 B     b        104
#>  5 B     a        105
#>  6 B     b        106
#>  7 C     a        107
#>  8 C     b        108
#>  9 C     a        109
#> 10 D     b        110
#> 11 D     a        111
#> 12 D     b        112

Now let's group by gr1 and gr2 and summarize the sum of each group...

data_group <- data |>
  group_by(gr1, gr2) |>
  summarise(gr_sum = sum(values)) |>
  ungroup()

And there it is... THAT ###### MESSAGE, it always causes a millisecond of panic when it streams past your face.

#> `summarise()` has grouped output by 'gr1'. You can override using the `.groups` argument.

As always, it's a false alarm. Everything worked as expected anyway, whew!

data_group
#> # A tibble: 8 × 3
#> # Groups:   gr1 [4]
#>   gr1   gr2   gr_sum
#>   <chr> <chr>  <int>
#> 1 A     a        204
#> 2 A     b        102
#> 3 B     a        105
#> 4 B     b        210
#> 5 C     a        216
#> 6 C     b        108
#> 7 D     a        111
#> 8 D     b        222

That's good enough for lots of folks... just ignore the message. Some of us may be inspired to suppress the message with a options(dplyr.summarise.inform = FALSE) -- like the youtube video I linked to above suggests. Some might dig into the docs and supply a .groups = 'drop_last' to summarise() to make it shut up.

But LOOK at the message:

`summarise()` has grouped output by 'gr1'. You can override using the `.groups` argument.

That seems wrong! I JUST grouped using group_by(gr1, gr2). I grouped by BOTH gr1 and gr2. Is it a matter of semantics here? Is it that summarise grouped the output by gr1?? And anyway, I ALSO ungrouped() at the end, so there should be NO GROUPS(*). The message is very confusing. How does it help anyone?

Many of us prefer to write code that doesn't blast out warnings but that also doesn't mask them. My inclination is to follow the advice of the message and put in a .groups argument. It doesn't remind me in the message what to give the .groups arg, so that's a quick trip to the docs where one is presented with some options...

  • "drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.
  • "drop": All levels of grouping are dropped.
  • "keep": Same grouping structure as .data.
  • "rowwise": Each row is its own group.

I am not a huge fan of the word "drop" when it comes to anything related to data. It has a connotation of deletion, like, permanent deletion. I invariably choose "drop_last" simply because the words in the docs suggest that's what was used earlier on in the days before this message started popping up.

But going on to the next paragraph in the docs...

When .groups is not specified, it is chosen based on the number of rows of the results:

  • If all the results have 1 row, you get "drop_last".
  • If the number of rows varies, you get "keep" (note that returning a variable number of rows was deprecated in favor of reframe(), which also unconditionally drops all levels of grouping).

In addition, a message informs you of that choice, unless the result is ungrouped, the option "dplyr.summarise.inform" is set to FALSE, or when summarise() is called from a function in a package.

Emphasis was mine. I did actually ungroup() immediately after summarise(). Are the docs saying that I should not have gotten the message?

Sorry if this sounds a bit complain-y. I really love tidyverse and dplyr. I use it a lot. Would just like to understand the underlying rationale behind this message and get some tips for better usage. Messages like that make me think that something is wrong. I would just like to write better code.

(*) To be honest, I don't remember why I almost always ungroup() at the end of dplyr stanza where I've done a group_by(). Maybe once or twice I've actually needed the groups after a stanza (can't remember why now, but I do remember telling myself "hmm... that's interesting, here's a situation where I don't want to ungroup()"). Nor do I understand the consequences of leaving the groups in. Does it matter? Can I just call group_by(...) again as needed regardless of existing grouping?

The message about summarize() grouping the output is produced by the summarize function, so it is true as the code exits summarize and before the subsequent ungroup(). Your tibble named data_group is ungrouped in the next step, but summarize() doesn't know that. If you habitually end such blocks of code with ungroup(), you could replace that with setting the argument .groups to "drop". Here is a demonstration of the effect of the default behavior when there is a subsequent summarize() followed by an example of using ungroup().

library(tidyverse)

data <- tibble(gr1    = rep(LETTERS[1:4], each = 3),
               gr2    = rep(letters[1:2], times = 6),
               values = 101:112)

data_group <- data |>
  group_by(gr1, gr2) |>
  summarise(gr_sum = sum(values)) #eliminate the ungroup() function
#> `summarise()` has grouped output by 'gr1'. You can override using the `.groups`
#> argument.
data_group
#> # A tibble: 8 × 3
#> # Groups:   gr1 [4]
#>   gr1   gr2   gr_sum
#>   <chr> <chr>  <int>
#> 1 A     a        204
#> 2 A     b        102
#> 3 B     a        105
#> 4 B     b        210
#> 5 C     a        216
#> 6 C     b        108
#> 7 D     a        111
#> 8 D     b        222

data_group |> summarize(Total = sum(gr_sum)) #data_group is grouped by gr1
#> # A tibble: 4 × 2
#>   gr1   Total
#>   <chr> <int>
#> 1 A       306
#> 2 B       315
#> 3 C       324
#> 4 D       333

data_group2 <- data |>
  group_by(gr1, gr2) |>
  summarise(gr_sum = sum(values)) |>  
  ungroup()
#> `summarise()` has grouped output by 'gr1'. You can override using the `.groups`
#> argument.

data_group2 |> summarize(Total = sum(gr_sum)) #data_group2 has no groups
#> # A tibble: 1 × 1
#>   Total
#>   <int>
#> 1  1278

Created on 2023-10-03 with reprex v2.0.2
I don't know why the default is "drop_last". My reflex would be to make "drop" the default but people way smarter than me decided otherwise.

2 Likes

I think this is for backward compatibility, but it's a horrible default in my opinion and can lead to unexpected behaviour.

If there is more than one grouping variable, then I would always specify the .groups option or add ungroup() at the end (which works when group_by() is used without summarise()).

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.