I once had a problem that was solved with ungroup() so I started using it all the time, but wondering if it's really necessary. Would love to hear what others do.
I tend to use ungroup()
after every group_by()
for a few reasons:
- Avoid potential unintended errors due to the grouping.
- Makes pipes more readable by explicitly pointing out places where the data is being operated on according to groups.
- I like to save transformed datasets as .Rdata objects to speed up loading times for scripts I run often and Shiny apps. The groupings are retained in such objects. By ensuring that I always
ungroup
, I avoid situations where I load an .Rdata object a year later and struggle with a problem not realizing a grouping has been applied.
Hi,
That's very helpful... I will continue to ungroup. I was going to ask if you had an example in which not doing so caused a problem but I answered my own question (by chance). Here's a MWE that reproduces the error I got:
> data.frame(Titanic) %>%
group_by(Class, Age) %>%
summarize(Freq = sum(Freq)) %>%
mutate(Class = reorder(Class, Freq))
Error in mutate_impl(.data, dots) :
Column `Class` can't be modified because it's a grouping variable
Note that it doesn't happen with just one group_by
variable since summarize()
removes the last grouping variable:
> data.frame(Titanic) %>%
group_by(Class) %>%
summarize(Freq = sum(Freq)) %>%
mutate(Class = reorder(Class, Freq))
# A tibble: 4 x 2
Class Freq
<fct> <dbl>
1 1st 325
2 2nd 285
3 3rd 706
4 Crew 885
EDIT (in response to @danr's post): To sum up the context: I am not asking for help debugging this code. I know that the problem is that I didn't ungroup()
. The point is to illustrate why it's important to use ungroup()
.
group_by adds metadata to a data.frame that marks how rows should be grouped. As long as that metadata is there you won't be able to change the factors of the columns involved in the grouping. See the following examples.
You should use a reproducible example for your code. See:
https://www.jessemaegan.com/post/so-you-ve-been-asked-to-make-a-reprex
As is with your code it isn't possible to tell is you meant to use plyr::summarize or dplyr::summarize.
Also a reprex makes it possible for us to just copy paste you code and be able to run it in the same environment that you did. Everyone here is answering questions on their own time so we ask that you do what you can to minimize that time... a reprex is the best way to do that.
suppressPackageStartupMessages(library(dplyr))
# first of all dplyr::group_by adds meta-data to
# the data.frame that other functions, like
# dplry::summaraize use when the do calculations
t1 <- data.frame(Titanic) %>%
group_by(Class, Age)
# notice that the meta-data show how rows
# should be grouped
str(t1)
#> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 32 obs. of 5 variables:
#> $ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
#> $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
#> $ Age : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
#> $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#> $ Freq : num 0 0 35 0 0 0 17 0 118 154 ...
#> - attr(*, "vars")= chr "Class" "Age"
#> - attr(*, "drop")= logi TRUE
#> - attr(*, "indices")=List of 8
#> ..$ : int 0 4 16 20
#> ..$ : int 8 12 24 28
#> ..$ : int 1 5 17 21
#> ..$ : int 9 13 25 29
#> ..$ : int 2 6 18 22
#> ..$ : int 10 14 26 30
#> ..$ : int 3 7 19 23
#> ..$ : int 11 15 27 31
#> - attr(*, "group_sizes")= int 4 4 4 4 4 4 4 4
#> - attr(*, "biggest_group_size")= int 4
#> - attr(*, "labels")='data.frame': 8 obs. of 2 variables:
#> ..$ Class: Factor w/ 4 levels "1st","2nd","3rd",..: 1 1 2 2 3 3 4 4
#> ..$ Age : Factor w/ 2 levels "Child","Adult": 1 2 1 2 1 2 1 2
#> ..- attr(*, "vars")= chr "Class" "Age"
#> ..- attr(*, "drop")= logi TRUE
Created on 2018-02-16 by the reprex package (v0.2.0).
suppressPackageStartupMessages(library(dplyr))
# dplyr::summerize passes along that information
t2 <- data.frame(Titanic) %>%
group_by(Class, Age) %>%
summarize(Freq = sum(Freq))
t2
#> # A tibble: 8 x 3
#> # Groups: Class [?]
#> Class Age Freq
#> <fct> <fct> <dbl>
#> 1 1st Child 6.00
#> 2 1st Adult 319
#> 3 2nd Child 24.0
#> 4 2nd Adult 261
#> 5 3rd Child 79.0
#> 6 3rd Adult 627
#> 7 Crew Child 0
#> 8 Crew Adult 885
str(t2)
#> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 8 obs. of 3 variables:
#> $ Class: Factor w/ 4 levels "1st","2nd","3rd",..: 1 1 2 2 3 3 4 4
#> $ Age : Factor w/ 2 levels "Child","Adult": 1 2 1 2 1 2 1 2
#> $ Freq : num 6 319 24 261 79 627 0 885
#> - attr(*, "vars")= chr "Class"
#> - attr(*, "drop")= logi TRUE
Created on 2018-02-16 by the reprex package (v0.2.0).
# the following fails because mutate is trying
# change one of the columns used by group_by
# and it can see that because of the meta-data
# passed through by dplyr::summarize
suppressPackageStartupMessages(library(dplyr))
t3 <- data.frame(Titanic) %>%
group_by(Class, Age) %>%
summarize(Freq = sum(Freq)) %>%
mutate(Class = reorder(Class, Freq))
#> Error in mutate_impl(.data, dots): Column `Class` can't be modified because it's a grouping variable
Created on 2018-02-16 by the reprex package (v0.2.0).
# ungroup removes any grouping meta-data so
suppressPackageStartupMessages(library(dplyr))
t4 <- data.frame(Titanic) %>%
group_by(Class, Age) %>%
ungroup()
# notice there is no grouping meta-data in t4
str(t4)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 32 obs. of 5 variables:
#> $ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
#> $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
#> $ Age : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
#> $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#> $ Freq : num 0 0 35 0 0 0 17 0 118 154 ...
Created on 2018-02-16 by the reprex package (v0.2.0).
suppressPackageStartupMessages(library(dplyr))
# so by ungroup before running mutate
# lets the factors be changed
suppressPackageStartupMessages(library(dplyr))
t5 <- data.frame(Titanic) %>%
group_by(Class, Age) %>%
summarize(Freq = sum(Freq)) %>%
ungroup() %>%
mutate(Class = reorder(Class, Freq))
str(t5)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 8 obs. of 3 variables:
#> $ Class: Factor w/ 4 levels "2nd","1st","3rd",..: 2 2 1 1 3 3 4 4
#> ..- attr(*, "scores")= num [1:4(1d)] 162 142 353 442
#> .. ..- attr(*, "dimnames")=List of 1
#> .. .. ..$ : chr "1st" "2nd" "3rd" "Crew"
#> $ Age : Factor w/ 2 levels "Child","Adult": 1 2 1 2 1 2 1 2
#> $ Freq : num 6 319 24 261 79 627 0 885
Created on 2018-02-16 by the reprex package (v0.2.0).
In my experience, the most common error that results from grouping is the one you showed. You can't performance any operations on the grouping variables, meaning they can't be mutated or summarized. I tend to deal with this issue when I'm using ggplot2 to visualize a dataset. I often struggle to get the labels just right using calculations, so I'll typically create a summarized view of the dataset to calculate labels that I'll need in my visual (the values for averages, medians, etc.). Whenever I have problems getting the summary view to work, it's typically because I applied some sort of grouping to the main dataset and forgot to remove it.
That's exactly what I was doing: changing factor levels for a plot. Ungroup to the rescue.