Multi-column-type Summarise

bhive01 · June 13, 2018, 2:19pm

Similar to here:

I'm wanting to summarize multiple classes of columns in a single summarise_all() call, with the caveat that sometimes I want even more control than that, to be able to summarise all numerics one way, most character columns with paste(collapse) and some with first.

In this example,

library(dplyr)

data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column

iris %>% 
  group_by(Species, year) %>% 
  summarise_all(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))

Now, what if I want to take the first element of Species, but I want to keep all instances of color?

I have to do them one at a time and them combine them with a join.

colors <- 
iris %>%
    group_by(Species, year) %>%
    summarise_at(vars(color), ~paste(.x, collapse="; "))

others <-
iris %>%
    select(-color) %>% group_by(Species, year) %>%
    summarise_all(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))

full_join(colors, others)

Is there a way to do numerics and a mixture of functions for specific columns?

tbradley · June 13, 2018, 2:26pm

Hi, it looks like your code was not formatted correctly to make it easy to read for people trying to help you. Formatting code allows for people to more easily identify where issues may be occuring, and makes it easier to read, in general. I have edited you post to format the code properly.

In the future please put code that is inline (such as a function name, like mutate or filter) inside of backticks (`mutate`) and chunks of code can be put between sets of three backticks:

```
example <- foo %>%
  filter(a == 1)
```

This process can be done automatically by highlighting your code, either inline or in a chunk, ad clicking the </> button on the toolbar of the reply window!

This will help keep our community tidy and help you get the help you are looking for!

For more information, please take a look at the community's FAQ on formating code

bhive01 · June 13, 2018, 2:28pm

Yeah... hit CMD + enter and submitted before I was done putting it together. Thought I was in R...

jcblum · June 13, 2018, 2:55pm

won’t say that hasn’t happened to me....

Thankfully, you can edit your own posts using the little grey pencil button at the bottom!

markdly · June 14, 2018, 4:25am

I recently asked a similar question so it's good to know I'm not the only one puzzling over this.

I think we can avoid using a join by slightly tweaking your existing code

Expected output

df1 <- full_join(colors, others) 
#> Joining, by = c("Species", "year")

Solution

df2 <- iris %>%
  group_by(Species, year) %>%
  summarise_all(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else paste(., collapse="; "))) %>% 
  select(Species, year, color, everything())

identical(ungroup(df1), ungroup(df2))
#> [1] TRUE

I can avoid using the join in this case as I don't think the first function is actually ever used. This is because group_by variables are not included as variables to summarise when using summarise_all.

Is the real use case more complex though? If so, I don't know of an 'off-the-shelf' tidyverse solution for this. One workaround I've been using is to replace the use of group_by with nest and work on the nested list column.

For example, to avoid using joins you could nest everything except the groups you'd like to summarise and then bind columns for each type of summary needed. E.g.

df3 <- iris %>% 
  as_tibble() %>% 
  nest(-Species, -year) %>% 
  mutate(data = map(data, ~ bind_cols(
      color = paste(.$color, collapse = "; "),
      summarise_if(., is.numeric, funs(mean(., na.rm = TRUE)))))) %>% 
  unnest(data)

identical(ungroup(df1), df3)
#> [1] TRUE

The trick here is each type of summary either needs to return a named single value or a one row tibble otherwise bind_cols will complain. This approach avoids using a join and gives access to variable names if needed. You can also include as many summarise_* calls as you like. But TBH, I don't find this as readable as your original code using a join.

Jenny Bryan's row-oriented workflows repository might also be another place to look - there could be some unearthed pearls of wisdom in there related to this...

bhive01 · June 25, 2018, 4:27pm

@markdly, you're absolutely right for the example I used I could group by the variables that I just want first(.) from. I was overthinking it in this case.
I guess more generally if you wanted to summarise using different functions for different sets of columns then nesting and supplying functions for each column would work. It's much more expressive than I had hoped for. Perhaps some further magic with map could be done to simply supply the function name per column. I'll have to think on it some more. I do this often enough that I should make it easier on myself by functionalizing it.