summarise_at for grouped data: apply the same functions but different arguments to columns

Dong · March 31, 2019, 7:49am

Hard for me to generalize the questions. Below is an example: picking out the first row that is less than half of the column maxima for column A, B, C. The outlier is column X, where the maximum should be the average of the maxima of the normal columns (A, B, & C). The code below works for non-grouped data:

library(tidyverse)

df <- tribble(
  ~A, ~B, ~C, ~X,
  10, 12,  8,  5,
   8,  5,  7,  4,
   4,  2,  4,  3,
   2,  1,  2,  2,
   0,  0,  0,  1
  )

df %>% 
  {
    X_max <- summarise_at(., vars(A:C), max) %>% rowMeans()
    bind_cols(summarise_at(., vars(A:C), list(~match(TRUE, .< 0.5 * max(.)))),
             summarise_at(., vars(X), list(~match(TRUE, .< 0.5 * X_max)))
)}
#> # A tibble: 1 x 4
#>       A     B     C     X
#>   <int> <int> <int> <int>
#> 1     3     2     4     2

^{Created on 2019-03-30 by the reprex package (v0.2.1)}

The results for column A is 3 because A[3] = 4 < 0.5 * 10;
the results for column X is 2 because X[2] = 4 < mean(10, 12, 8) * 0.5.

My question is how to do the same for grouped data? For example df2 <- bind_rows(df, df +2, .id = "grp") %>% group_by(grp)

andresrcs · March 31, 2019, 9:11pm

This would work but for group relative positions

library(tidyverse)

df <- tribble(
    ~A, ~B, ~C, ~X,
    10, 12,  8,  5,
    8,  5,  7,  4,
    4,  2,  4,  3,
    2,  1,  2,  2,
    0,  0,  0,  1
)
df2 <- bind_rows(df, df + 2, .id = "grp")

df2 %>% 
    group_by(grp) %>% 
    mutate_at(vars(X), list(~. < 0.5 * mean(max(A), max(B), max(C)))) %>%
    mutate_at(vars(A, B, C), list(~. < 0.5 * max(.))) %>%
    summarise_at(vars(A,B,C,X), ~match(TRUE, .))
#> # A tibble: 2 x 5
#>   grp       A     B     C     X
#>   <chr> <int> <int> <int> <int>
#> 1 1         3     2     4     2
#> 2 2         4     3     4     3

^{Created on 2019-03-31 by the reprex package (v0.2.1.9000)}

And this is a little hacky but would work for absolute positions

df2 <- bind_rows(df, df + 2, df + 4, .id = "grp")

df2 %>% 
    group_by(grp) %>% 
    mutate_at(vars(X), list(~. < 0.5 * mean(max(A), max(B), max(C)))) %>%
    mutate_at(vars(A, B, C), list(~. < 0.5 * max(.))) %>%
    summarise_at(vars(A,B,C,X), ~match(TRUE, .)) %>% 
    cbind(df2 %>% 
              count(grp) %>% 
              mutate(n = cumsum(n), n = lag(n)) %>% 
              select(-grp)) %>%
    rowwise() %>% 
    mutate_at(vars(A,B,C,X), ~sum(c(.,n), na.rm = TRUE)) %>% 
    select(-n)
#> Source: local data frame [3 x 5]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 5
#>   grp       A     B     C     X
#>   <chr> <int> <int> <int> <int>
#> 1 1         3     2     4     2
#> 2 2         9     8     9     8
#> 3 3        14    13    15    14

Col · April 1, 2019, 11:05am

The results between the two are different (see second group). Is the hacky solution necessary? I really like the first solution.

I've had trouble wrapping my head around this one. Is there a way to have a lookup table that includes which functions each one is summarized by? Something like:

lookup_funs <- tribble(
  ~letter_grp, ~functions,
  'A',   'prod(max(A), 0.5)',
  'B',   'prod(max(B), 0.5)',
  'C',   'prod(max(C), 0.5)',
  'X',   'prod(mean(max(A),max(B),max(C)), 0.5)'
)

andresrcs · April 1, 2019, 11:23am

Yes, they are different, because the hacky solution is giving absolute positions (row numbers are relative to the whole table) instead of relative ones (row numbers restar for each category) , if you just need positions relative to the groups then you can use the first solution.

Col · April 1, 2019, 12:07pm

I misread the prompt. I kept reading the solutions as the raw values as opposed to the location within the matrix.

If you know that you would always have the same number of rows per grp, you could multiple the row_number() by 5. Still, I'd like it if the data were tidy because then you could do a join and then slice.

df2 %>% 
  group_by(grp) %>% 
  mutate_at(vars(X), list(~. < 0.5 * mean(max(A), max(B), max(C)))) %>%
  mutate_at(vars(A, B, C), list(~. < 0.5 * max(.))) %>%
  summarise_at(vars(A,B,C,X), ~match(TRUE,.))%>%
  mutate_at(vars(A,B,C,X), ~ . + (row_number()-1)*5 )

edit: formatting

Dong · April 2, 2019, 6:55am

Great answer. I don't need the absolute position. So the first one works.

However, what I am really after is the way to pass arguments (here the summary max of a grouped dataframe) to the same grouped data frame. The mutate_at basically decomposes the summarise function. Is there a way to keep the function self-contained with respect to a vector? It is also should be faster.

Now I think something like map2 (example) may be needed.

reddyr · April 2, 2019, 2:52pm

you can take advantage of do()

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.5.2
#> Warning: package 'ggplot2' was built under R version 3.5.2
library("reprex")
#> Warning: package 'reprex' was built under R version 3.5.3

df <- tribble(
  ~A, ~B, ~C, ~X,
  10, 12,  8,  5,
  8,  5,  7,  4,
  4,  2,  4,  3,
  2,  1,  2,  2,
  0,  0,  0,  1
)

df2 <- bind_rows(df, df +2, .id = "grp")

df2 %>% 
  group_by(grp) %>% 
  do(
    { X_max <- summarise_at(., vars(A:C), max) %>% rowMeans();
      bind_cols(summarise_at(., vars(A:C), list(~match(TRUE, .< 0.5 * max(.)))),
                summarise_at(., vars(X), list(~match(TRUE, .< 0.5 * X_max)))
      )
    }
  )
#> # A tibble: 2 x 5
#> # Groups:   grp [2]
#>   grp       A     B     C     X
#>   <chr> <int> <int> <int> <int>
#> 1 1         3     2     4     2
#> 2 2         4     3     4     3

Col · April 3, 2019, 1:30am

Here is my lookup table solution. It seemed like there was a lot of typing of A, B, C, and X, so if you needed to update it down the road, it'd be a pain.

Step 1. Make lookup table with summary functions

Step 2. Run the summary functions on the dataset

Step 3. Gather (i.e., unpivot) the data to long format

Step 4. Join, filter, and slice.

I added a lot of stuff to steps 3 and 4 since positions seemed of interest.

library(tidyverse)

df <- tribble(
  ~A, ~B, ~C, ~X,
  10, 12,  8,  5,
  8,  5,  7,  4,
  4,  2,  4,  3,
  2,  1,  2,  2,
  0,  0,  0,  1
)

df2 <- bind_rows(df, df + 2, df + 4, .id = "grp")

lookup_funs <- tribble(
  ~letter_grp, ~functions,
  'A',   'prod(max(A), 0.5)',
  'B',   'prod(max(B), 0.5)',
  'C',   'prod(max(C), 0.5)',
  'X',   'prod(mean(max(A),max(B),max(C)), 0.5)'
)

summary_data <- lookup_funs%>%
  group_by(letter_grp)%>%
  do(summarize(group_by(df2,grp), lookup_val = eval(parse(text = .$functions))))

df2%>%
  gather(key = 'letter_grp', value = 'value', A:X, -grp)%>%
  group_by(letter_grp)%>%
  mutate(absolute_position = row_number())%>%
  group_by(grp, letter_grp)%>%
  mutate(relative_position = row_number())%>%
  inner_join(summary_data)%>%
  filter(value < lookup_val)%>%
  slice(1)

# A tibble: 12 x 6
# Groups:   grp, letter_grp [12]
   grp   letter_grp value absolute_position relative_position lookup_val
   <chr> <chr>      <dbl>             <int>             <int>      <dbl>
 1 1     A              4                 3                 3          5
 2 1     B              5                 2                 2          6
 3 1     C              2                 4                 4          4
 4 1     X              4                 2                 2          5
 5 2     A              4                 9                 4          6
 6 2     B              4                 8                 3          7
 7 2     C              4                 9                 4          5
 8 2     X              5                 8                 3          6
 9 3     A              6                14                 4          7
10 3     B              6                13                 3          8
11 3     C              4                15                 5          6
12 3     X              6                14                 4          7

Dong · April 3, 2019, 11:45pm

Thx. do is nice. Thanks for sharing. I did some digging and comes up with the following using group_map & map2 (somehow do carries the grouping variable and does not work).

library(tidyverse)

df <- tribble(
  ~A, ~B, ~C, ~X,
  10, 12,  8,  5,
  8,  5,  7,  4,
  4,  2,  4,  3,
  2,  1,  2,  2,
  0,  0,  0,  1
)

df2 <- bind_rows(df, df +2, .id = "grp")

df2 %>% 
  group_by(grp) %>% 
  group_map(~
    { 
      all_max <- summarise_at(., vars(A:C), max) %>% mutate(X = rowMeans(.))
      map2_dfc(., all_max, ~match(TRUE, .x < 0.5 * .y))
    }
 )
#> # A tibble: 2 x 5
#> # Groups:   grp [2]
#>   grp       A     B     C     X
#>   <chr> <int> <int> <int> <int>
#> 1 1         3     2     4     2
#> 2 2         4     3     4     3

^{Created on 2019-04-03 by the reprex package (v0.2.1)}
I think this is pretty close to what I want in the OP. One remaining question, is there a way to pass arguments (here the max values) as a vector to corresponding column in summarise_all? It would be really convenient when two or more functions are applied.

Col · April 4, 2019, 12:11am

To plug my solution, this allows you to choose any function for each letter. I added spread() so that it matches the formatting you want at the end. The downside is that it adds a lot of extra typing for columns A:C because they are simply maximums.

library(tidyverse)

df <- tribble(
  ~A, ~B, ~C, ~X,
  10, 12,  8,  5,
  8,  5,  7,  4,
  4,  2,  4,  3,
  2,  1,  2,  2,
  0,  0,  0,  1
)

df2 <- bind_rows(df, df + 2, df + 4, .id = "grp")

lookup_funs <- tribble(
  ~letter_grp, ~functions,
  'A',   'prod(max(A), 0.5)',
  'B',   'prod(max(B), 0.5)',
  'C',   'prod(max(C), 0.5)',
  'X',   'prod(mean(max(A),max(B),max(C)), 0.5)'
)

summary_data <- lookup_funs%>%
  group_by(letter_grp)%>%
  do(summarize(group_by(df2,grp), lookup_val = eval(parse(text = .$functions))))

df2%>%
  gather(key = 'letter_grp', value = 'value', -grp)%>%
  inner_join(summary_data)%>%
  group_by(grp, letter_grp)%>%
  mutate(relative_position = row_number())%>%
  filter(value < lookup_val)%>%
  slice(1)%>%
  select(grp, letter_grp, relative_position)%>%
  spread(letter_grp, relative_position)

  grp       A     B     C     X
  <chr> <int> <int> <int> <int>
1 1         3     2     4     2
2 2         4     3     4     3
3 3         4     3     5     4

system · April 11, 2019, 12:12am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.