Understanding group_by: order matters?

irene · January 30, 2019, 7:06am

In the process of trying to pinpoint when ungroup() is important, I realized I may not understand how group_by() works as well as I had thought, especially when it is used in combination with summarize().

In particular, I didn't realize that the order of variables within group_by() matters. It appears that after a summarize(), only the first grouping variable remains grouped. This result was totally unintuitive to me, so I figured I'd ask-- does this make sense to everyone? or is it a bug?

For example, if I adapt the example from this thread:

library(tidyverse)

data.frame(Titanic) %>% 
    group_by(Class, Age) %>%  
    summarize(Freq = sum(Freq)) %>% 
    mutate(Class = reorder(Class, Freq))
#> Error in mutate_impl(.data, dots): Column `Class` can't be modified because it's a grouping variable

#when I switch the order within the group_by(), it works
data.frame(Titanic) %>% 
    group_by(Age, Class) %>% 
    summarize(Freq = sum(Freq)) %>% 
    mutate(Class = reorder(Class, Freq))
#> # A tibble: 8 x 3
#> # Groups:   Age [2]
#>   Age   Class  Freq
#>   <fct> <fct> <dbl>
#> 1 Child 1st       6
#> 2 Child 2nd      24
#> 3 Child 3rd      79
#> 4 Child Crew      0
#> 5 Adult 1st     319
#> 6 Adult 2nd     261
#> 7 Adult 3rd     627
#> 8 Adult Crew    885

martin.R · January 30, 2019, 7:38am

The last item in the list gets stripped from the group after summarise().

You should generally ungroup() unless you need to operate on the reduced group at the next step.

jdlong · January 30, 2019, 12:31pm

@martin.R is spot on. Any summarize drops the last grouping because otherwise any following operations would be row-wise operations as there are, by definition, only one row per original grouping. But don't feel like you should know this, I agree it's not intuitive. Every few months someone submits this as a bug on the dplyr github site. I had submitted a suggestion that dplyr should throw an information text explaining the dropped grouping. But the executive decision is "this is by design".

When to ungroup is sort of a different question. You can do all sorts of things with grouped data and not have any issue. Until you do. So my really unsophisticated approach is that I ungroup when grouped data does not do what I need it to. Yeah, that's not a very informative heuristic, I realize. My single most common reason for doing an ungroup is that I want to drop a variable that's in the grouping. dplyr will not let us drop a variable that's in a grouping.

Some people get rather prescriptive and make sure that for every group_by there's an associated ungroup as soon as possible. I can't live with that level of fascism, myself. But if others need heavy rules in order to sleep at night, then maybe it's a good idea.

martin.R · January 30, 2019, 12:38pm

I didn't realise we'd skirt this close to Godwin's Law over ungroup()!

My experience is that not ungrouping has bitten me and leaving data grouped has no advantage unless you explicitly need that grouping in a chain. Other things keep me awake at night instead ...

irene · January 30, 2019, 3:00pm

^ This helps! I hadn't thought of it this way before. I like thinking of group_by(x, y) and ungroup() as markers that say "everything between these two functions is grouped by x and y," but it looks like I need to adjust my mental model in the case of summarize()

At this point, I still think the automated dropping of a group (didn't realize it was specifically the last item, thanks @martin.R) is easy to forget if you're not paying close attention. For whatever reason, keeping all groupings (or even dropping all groupings) after summarize() feels more intuitive.

benj · February 1, 2019, 1:18am

Would this be helpful? https://github.com/elbersb/tidylog/issues/1

The tidylog package could also print a message after a summarize command to let the user know which groups are remaining, for instance:

data.frame(Titanic) %>%  
        group_by(Age, Class) %>%  
        summarize(Freq = sum(Freq)) %>%  
        mutate(Class = reorder(Class, Freq))                                                              
#> group_by: 8 groups [Age, Class] 
#> mutate: changed 8 values (100%) of 'Class', factor levels updated 
#> summarize: 2 groups remaining [Class]

irene · February 2, 2019, 9:26am

Yeah, I think that would be neat, @benj ! Though in your example, [Age] would remain after summarizing.

system · February 9, 2019, 9:26am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.