Data wrangling via group_by and merge effectively

AbhishekHP · June 28, 2019, 11:46am

Is there a better way to obtain z without using separate x,y variable.
They are used for 2 different ways of grouping variables.

library(tidyverse)
x = iris %>%
  group_by(Species) %>% 
  summarize(p = sum(Petal.Width))

y = iris %>%
  group_by(Species, Petal.Length) %>% 
  summarize(p = sum(Petal.Width))

z = merge(x,y, by = "Species")

Yarnabrina · June 28, 2019, 1:10pm

Are you after something like this?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris %>%
  group_by(Species, Petal.Length) %>% 
  summarize(p.y = sum(Petal.Width)) %>%
  mutate(p.x = sum(p.y)) %>%
  ungroup()
#> # A tibble: 48 x 4
#>    Species    Petal.Length   p.y   p.x
#>    <fct>             <dbl> <dbl> <dbl>
#>  1 setosa              1     0.2  12.3
#>  2 setosa              1.1   0.1  12.3
#>  3 setosa              1.2   0.4  12.3
#>  4 setosa              1.3   1.8  12.3
#>  5 setosa              1.4   2.7  12.3
#>  6 setosa              1.5   3.1  12.3
#>  7 setosa              1.6   2    12.3
#>  8 setosa              1.7   1.4  12.3
#>  9 setosa              1.9   0.6  12.3
#> 10 versicolor          3     1.1  66.3
#> # … with 38 more rows

^{Created on 2019-06-28 by the reprex package (v0.3.0)}

AbhishekHP · June 28, 2019, 2:09pm

Thanks for quick help.
using mutate, can we summarize for different grouping variable as well ?
Meaning...
in the below code, p.x summed for every Species but
can we use this approach for while summarizing for every Petal.Length ? since we don't mention the grouping variable, it was confusing for me.

iris %>%
  group_by(Species, Petal.Length) %>% 
  summarize(p.y = sum(Petal.Width)) %>%
  mutate(p.x = sum(p.y)) %>%
  ungroup()

Yarnabrina · June 28, 2019, 2:49pm

Sorry, I don't follow you. Doesn't it generate the same result as your z? The columns are reordered, but I don't think that's a big deal.

I used the names p.x and p.y to simplify comparison between this and z. There, p.y was calculated after grouping by both Species and Petal.Length, and p.x was calculated after grouping only by Species. Why do you want to summarise with grouping by Petal.Length now?

I get confused too, and to keep track of grouping, I just print out the intermediate results to help myself. %T>% is very helpful for this. Based on what I've noticed, if you group by multiple variables, one summarise will drop the last group variable, another will drop the second last, and so on.

If my solution doesn't work, can you please tell me what do you want to do and share your expected output?

AbhishekHP · June 28, 2019, 3:02pm

Actually, I meant instead of summarizing for every Species, can we choose to summarize for Petal.Length.

For example: What changes we should make in your code (mutate) to obtain xy and zy results ?

x = iris %>%
  group_by(Species) %>% 
  summarize(p = sum(Petal.Width))

y = iris %>%
  group_by(Species, Petal.Length) %>% 
  summarize(p = sum(Petal.Width))

xy = merge(x,y, by = "Species")

z = iris %>%
  group_by(Petal.Length) %>% 
  summarize(p = sum(Petal.Width))
  
zy = merge(z,y, by = "Petal.Length")

AbhishekHP · July 1, 2019, 2:33pm

Thanks Yarnabrina for quick code...another group_by would produce expected result.

iris %>%
group_by(Species, Petal.Length) %>%
summarize(p.y = sum(Petal.Width)) %>%
group_by(Petal.Width) %>%
mutate(p.x = sum(p.y)) %>%
ungroup()

This would yield zy result

Yarnabrina · July 1, 2019, 4:17pm

This is weird. I posted a code for zy result, but it's missing now and I've no idea why!!

Well, I can't run your code, and it shows:

Error: Column Petal.Width is unknown

It's not surprising, since there's no Petal.Width column after the first summarise. Please check.

A way to get zy can be the following, but may be others can make it better.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

y <- iris %>%
  group_by(Species, Petal.Length) %>% 
  summarize(p = sum(Petal.Width))
z <- iris %>%
  group_by(Petal.Length) %>% 
  summarize(p = sum(Petal.Width))

expected <- merge(x = z,
                  y = y,
                  by = "Petal.Length")

iris %>%
  mutate(Species = as.character(x = Species)) %>%
  bind_rows(mutate(.data = iris,
                   Species = "Overall")) %>%
  group_by(Species, Petal.Length) %>%
  summarise(p = sum(Petal.Width)) %>%
  ungroup() %>%
  group_split((Species != "Overall"),
              keep = FALSE) %>%
  inner_join(x = select(.data = .[[1]],
                        -Species),
             y = mutate(.data = .[[2]],
                        Species = as.factor(x = Species)),
             by = "Petal.Length") %>%
  as.data.frame() %>%
  all.equal(target = arrange(.data = expected,
                             Petal.Length, Species))
#> [1] TRUE

^{Created on 2019-07-01 by the reprex package (v0.3.0)}

joels · July 2, 2019, 4:31pm

Another option is to use map to iterate over the sets of grouping columns and then reduce for merging. For example:

library(tidyverse)

list("Species", c("Species", "Petal.Length")) %>%
  set_names(c("_Species", "_Species_Petal.Length")) %>% 
  map(~iris %>%
        group_by_at(.x) %>% 
        summarize(p = sum(Petal.Width))) %>% 
  reduce(left_join, by="Species", suffix=names(.)) %>% 
  select(Species, Petal.Length, everything())

# A tibble: 48 x 4
   Species    Petal.Length p_Species p_Species_Petal.Length
   <fct>             <dbl>     <dbl>                  <dbl>
 1 setosa              1        12.3                    0.2
 2 setosa              1.1      12.3                    0.1
 3 setosa              1.2      12.3                    0.4
 4 setosa              1.3      12.3                    1.8
 5 setosa              1.4      12.3                    2.7
 6 setosa              1.5      12.3                    3.1
 7 setosa              1.6      12.3                    2  
 8 setosa              1.7      12.3                    1.4
 9 setosa              1.9      12.3                    0.6
10 versicolor          3        66.3                    1.1
# … with 38 more rows

You can use the map approach to summarize over multiple sets of grouping columns and return a long summary data frame. For example:

# Add a couple of grouping columns to iris
set.seed(2)
dat = iris %>% 
  mutate(Group1=sample(c("A","B"), 150, replace=TRUE),
         Group2=sample(c("d","e"), 150, replace=TRUE))

# Get all combinations of 0 through 2 groups
map(0:2, ~combn(c("Species", "Group1","Group2"), .x, simplify=FALSE)) %>%
  flatten() %>% 
  # Run summarise on each of the group sets created above
  map_df(~dat %>% 
        group_by_at(.x) %>% 
        summarise(N=n(),
                  Petal.Width=mean(Petal.Width))) %>% 
  # Replace NA with "All" (representing marginalizing over that column)
  mutate_if(is.factor, as.character) %>% 
  map_if(~!is.numeric(.), ~replace_na(., replace="All")) %>% 
  bind_rows()

     N Petal.Width    Species Group1 Group2
1  150   1.1993333        All    All    All
2   50   0.2460000     setosa    All    All
3   50   1.3260000 versicolor    All    All
4   50   2.0260000  virginica    All    All
5   81   1.2444444        All      A    All
6   69   1.1463768        All      B    All
7   76   1.1447368        All    All      d
8   74   1.2554054        All    All      e
9   25   0.2640000     setosa      A    All
10  25   0.2280000     setosa      B    All
11  30   1.3466667 versicolor      A    All
12  20   1.2950000 versicolor      B    All
13  26   2.0692308  virginica      A    All
14  24   1.9791667  virginica      B    All
15  28   0.2500000     setosa    All      d
16  22   0.2409091     setosa    All      e
17  23   1.3521739 versicolor    All      d
18  27   1.3037037 versicolor    All      e
19  25   1.9560000  virginica    All      d
20  25   2.0960000  virginica    All      e
21  41   1.2048780        All      A      d
22  40   1.2850000        All      A      e
23  35   1.0742857        All      B      d
24  34   1.2205882        All      B      e

joels · July 2, 2019, 6:15pm

Sorry, I edited my answer earlier and accidentally deleted the code to create the new grouping columns. I've fixed it now. Thanks for pointing that out.

system · July 9, 2019, 6:15pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.