This has been bothering me for a long time, and I've never been able to gain an understanding of what is going on here.
We've all seen it. You do a summarise
in a group_by
and you get this message...
`summarise()` has grouped output by 'X'. You can override using the `.groups` argument.
The more I think about this, the more troubled I get. There must be a reason for this message. I struggle to understand what it's trying to tell me. And yes, I realize it's harmless but this bugs me and I want to understand what the point of these messages truly are.
Here's an example, taken from youtube:
Let's first create a tibble...
library(tidyverse)
data <- tibble(gr1 = rep(LETTERS[1:4], each = 3),
gr2 = rep(letters[1:2], times = 6),
values = 101:112)
data
OK, nothing fancy. Here it is...
#> # A tibble: 12 × 3
#> gr1 gr2 values
#> <chr> <chr> <int>
#> 1 A a 101
#> 2 A b 102
#> 3 A a 103
#> 4 B b 104
#> 5 B a 105
#> 6 B b 106
#> 7 C a 107
#> 8 C b 108
#> 9 C a 109
#> 10 D b 110
#> 11 D a 111
#> 12 D b 112
Now let's group by gr1 and gr2 and summarize the sum of each group...
data_group <- data |>
group_by(gr1, gr2) |>
summarise(gr_sum = sum(values)) |>
ungroup()
And there it is... THAT ###### MESSAGE, it always causes a millisecond of panic when it streams past your face.
#> `summarise()` has grouped output by 'gr1'. You can override using the `.groups` argument.
As always, it's a false alarm. Everything worked as expected anyway, whew!
data_group
#> # A tibble: 8 × 3
#> # Groups: gr1 [4]
#> gr1 gr2 gr_sum
#> <chr> <chr> <int>
#> 1 A a 204
#> 2 A b 102
#> 3 B a 105
#> 4 B b 210
#> 5 C a 216
#> 6 C b 108
#> 7 D a 111
#> 8 D b 222
That's good enough for lots of folks... just ignore the message. Some of us may be inspired to suppress the message with a options(dplyr.summarise.inform = FALSE)
-- like the youtube video I linked to above suggests. Some might dig into the docs and supply a .groups = 'drop_last'
to summarise() to make it shut up.
But LOOK at the message:
`summarise()` has grouped output by 'gr1'. You can override using the `.groups` argument.
That seems wrong! I JUST grouped using group_by(gr1, gr2)
. I grouped by BOTH gr1 and gr2. Is it a matter of semantics here? Is it that summarise grouped the output by gr1?? And anyway, I ALSO ungrouped()
at the end, so there should be NO GROUPS(*). The message is very confusing. How does it help anyone?
Many of us prefer to write code that doesn't blast out warnings but that also doesn't mask them. My inclination is to follow the advice of the message and put in a .groups
argument. It doesn't remind me in the message what to give the .groups
arg, so that's a quick trip to the docs where one is presented with some options...
- "drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.
- "drop": All levels of grouping are dropped.
- "keep": Same grouping structure as
.data
.- "rowwise": Each row is its own group.
I am not a huge fan of the word "drop" when it comes to anything related to data. It has a connotation of deletion, like, permanent deletion. I invariably choose "drop_last" simply because the words in the docs suggest that's what was used earlier on in the days before this message started popping up.
But going on to the next paragraph in the docs...
When
.groups
is not specified, it is chosen based on the number of rows of the results:
- If all the results have 1 row, you get "drop_last".
- If the number of rows varies, you get "keep" (note that returning a variable number of rows was deprecated in favor of
reframe()
, which also unconditionally drops all levels of grouping).In addition, a message informs you of that choice, unless the result is ungrouped, the option "dplyr.summarise.inform" is set to
FALSE
, or whensummarise()
is called from a function in a package.
Emphasis was mine. I did actually ungroup()
immediately after summarise()
. Are the docs saying that I should not have gotten the message?
Sorry if this sounds a bit complain-y. I really love tidyverse and dplyr. I use it a lot. Would just like to understand the underlying rationale behind this message and get some tips for better usage. Messages like that make me think that something is wrong. I would just like to write better code.
(*) To be honest, I don't remember why I almost always ungroup()
at the end of dplyr stanza where I've done a group_by()
. Maybe once or twice I've actually needed the groups after a stanza (can't remember why now, but I do remember telling myself "hmm... that's interesting, here's a situation where I don't want to ungroup()
"). Nor do I understand the consequences of leaving the groups in. Does it matter? Can I just call group_by(...)
again as needed regardless of existing grouping?