how to use map() with group_by

Hi,

Need help with map() when I would like to vary group_by parameters in existing list of dataframes. How does map() work in such cases. Thanks!

# Currently adjusting group by parameters manually in below sample
vol_fn <- function(df){
df <- df %>%
group_by("Column A", "Column B")%>%
summarize(Vol = sum(Vol)

return(df)
}

vol1 <- map(data, ~vol_fn(.))

# Looking for solution where I can change group by options such as below where I need to consider only one group
vol_fn <- function(df, x){
df <- df %>%
group_by(x)%>%
summarize(Vol = sum(Vol)

return(df)
}

vol2 <- map(data, ~vol_fn(., "Column A"))
# This above trial with x and map() doesn't seem to work. What would be the right way to use it? 
# Thanks!

First, I don't think this works, if you have "Column A" this will be interpreted as a single group called "column A" instead of the column of the data frame named `column A`. Here is a reprex, I believe it matches what you have:

library(tidyverse)

set.seed(1)
data <- list(
  tibble(`Column A` = letters[1:2] |> rep(3) |> rep(each = 2),
         `Column B` = LETTERS[1:3] |> rep(each = 4),
         Vol = rnorm(12)),
  tibble(`Column A` = letters[1:2] |> rep(3) |> rep(each = 2),
         `Column B` = LETTERS[1:3] |> rep(each = 4),
         Vol = rnorm(12))
)

# Currently adjusting group by parameters manually in below sample
vol_fn <- function(df){
  df <- df %>%
    group_by(`Column A`, `Column B`)%>%
    summarize(Vol = sum(Vol))
  
  return(df)
}

vol1 <- map(data, ~vol_fn(.))
#> `summarise()` has grouped output by 'Column A'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'Column A'. You can override using the
#> `.groups` argument.
vol1
#> [[1]]
#> # A tibble: 6 × 3
#> # Groups:   Column A [2]
#>   `Column A` `Column B`    Vol
#>   <chr>      <chr>       <dbl>
#> 1 a          A          -0.443
#> 2 a          B          -0.491
#> 3 a          C           0.270
#> 4 b          A           0.760
#> 5 b          B           1.23 
#> 6 b          C           1.90 
#> 
#> [[2]]
#> # A tibble: 6 × 3
#> # Groups:   Column A [2]
#>   `Column A` `Column B`    Vol
#>   <chr>      <chr>       <dbl>
#> 1 a          A          -2.84 
#> 2 a          B           0.928
#> 3 a          C           1.70 
#> 4 b          A           1.08 
#> 5 b          B           1.42 
#> 6 b          C          -1.91

Created on 2024-03-08 with reprex v2.0.2

Now, on to your question, you need to check the programming with dplyr vignette. Since group_by() uses data masking, so x is an env-variable that refers to a data-variable, so you need to embrace:

vol_fn <- function(df, x){
  df <- df %>%
    group_by({{x}})%>%
    summarize(Vol = sum(Vol))
  
  return(df)
}

vol2 <- map(data, ~vol_fn(., `Column A`))
vol2

or to stick with quoted variables:

vol_fn <- function(df, x){
  df <- df %>%
    group_by(.data[[x]])%>%
    summarize(Vol = sum(Vol))
  
  return(df)
}

vol2 <- map(data, ~vol_fn(., "Column A"))

Finally, this works fine for 1 column, but you will run into a problem with 2 columns, if you want to call map(data, ~vol_fn(., `Column A`, `Column B`)). In that case you can use ...:

vol_fn <- function(df, ...){
  df <- df %>%
    group_by(...)%>%
    summarize(Vol = sum(Vol))
  
  return(df)
}

vol2 <- map(data, ~vol_fn(., `Column A`, `Column B`))
vol2

Thanks @AlexisW !

I never used variable as env-variable before. This link is helpful.
So, "..." in place of "x" can be applied for any number of columns or it would be just for 2 columns as in this case.
Last, do you think map2 might be used here. I have always struggled to used map2.

Thanks!

Just try it! :slight_smile: I expect it should work for any number of columns (as the ... gets unpacked it's just equivalent to writing several columns in the group_by()).

not really: map2() takes 2 list of the same name, and processes them in parallel. If that helps, these are basically equivalents:

mylist1 <- list(a, b, c)
mylist2 <- list(1, 2, 3)

# version with map2
map2(mylist1, mylist2, myfunction)

# version with for loop
stopifnot( length(mylist1) == length(mylist2) )

for(i in 1:length(mylist1)){
  myfunction( mylist1[[i]], mylist2[[i]] )
}

here I don't see you having two lists, so not sure how map2() would help.

1 Like

Thank you @AlexisW !
Very Helpful!

I'd like to draw attention to the currently experimental functions group_modify and group_walk in this category, especially if you don't need a reusable function to be defined outside of the pipe.

I recently had an application where I needed to group on a certain variable and iterate over those groups in specific order, where each group was further (sub)grouped and some mutate() function calls were performed. Group IDs were being utilized and doing this all within the pipe made keeping track of counters and names very simple.

The sequence of group_by > group_modify( group_by > manipulation ) > ungroup worked very well and was very readable in this context!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.