Mutate versus Summarize in a rowwise workflow

laderast · October 10, 2020, 8:30pm

Hi All,

I'm currently working on a rowwise() tutorial (https://github.com/laderast/tidyowl) and I realized that one thing I don't understand is the behavior of mutate() versus summarize() in a rowwise workflow.

They seem to do something similar, except that summarize() lets you select the outputted columns. For example:

penguins %>%
  rowwise() %>%
  summarize(species, island,
            sum_mm = bill_length_mm + flipper_length_mm)

Will only return the species, island, and sum_mm columns, but

penguins %>%
  rowwise() %>%
  mutate(sum_mm = bill_length_mm + flipper_length_mm)

will add the sum_mm column to the original data.frame.

Am I missing something here, or is that the basic difference between the two in a rowwise() workflow?

Thanks,
Ted

siddharthprabhu · October 11, 2020, 7:28am

Using summarise() along with rowwise() doesn't make much sense since summarise() performs aggregation and you can't aggregate single-row groups any further.

Supplying variables to summarise() without an aggregation method is essentially asking for it to be returned unchanged (i.e. summarise(species, island) is equivalent to summarise(species = species, island = island).

By the way, arithmetic operators like + will natively perform row-wise calculations so you don't need rowwise().

library(dplyr, warn.conflicts = FALSE)
library(palmerpenguins)

penguins %>%
  mutate(sum_mm = bill_length_mm + flipper_length_mm, .keep = "used")
#> # A tibble: 344 x 3
#>    bill_length_mm flipper_length_mm sum_mm
#>             <dbl>             <int>  <dbl>
#>  1           39.1               181   220.
#>  2           39.5               186   226.
#>  3           40.3               195   235.
#>  4           NA                  NA    NA 
#>  5           36.7               193   230.
#>  6           39.3               190   229.
#>  7           38.9               181   220.
#>  8           39.2               195   234.
#>  9           34.1               193   227.
#> 10           42                 190   232 
#> # ... with 334 more rows

^{Created on 2020-10-11 by the reprex package (v0.3.0)}

laderast · October 11, 2020, 3:15pm

Thanks for the response.

I am going off the rowwise() vignette, which does cover using rowwise() %>% summarize(), so it's not like it's an unseen pattern. https://dplyr.tidyverse.org/articles/rowwise.html

joels · October 12, 2020, 8:31pm

It's in the rowwise vignette, but it's an unusual approach, and, at least for me, a confusing one. Normally, one uses summarise to collapse a data frame down to a smaller number of rows while generating summary values (counts, averages, etc.) and mutate to add columns to a data frame without changing the rest of the data frame.

In the code below, the first example is from your question and parallels the rowwise vignette. The second example is what I think would be considered the "typical" way of achieving the same goal.

library(palmerpenguins)
library(tidyverse)

penguins %>%
  rowwise() %>%
  summarize(species, island,
            sum_mm = bill_length_mm + flipper_length_mm) 

penguins %>% 
  mutate(sum_mm = bill_length_mm + flipper_length_mm) %>% 
  select(species, island, sum_mm)

laderast · October 12, 2020, 10:13pm

Yes, thanks for your feedback. Given both of these responses, I think it's probably easier to not talk about summarize() and just talk about mutate() in a rowwise context, with select() as an option.

system · November 2, 2020, 10:13pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.