group_max performance

Dong · April 6, 2019, 12:12am

I came up with a solution using group_map for my question on grouped caculation. I understand that there must be some overhead in splitting the data frame, but the performance is much worse (see below).

Any idea to improve the performance? The basic situation is that the calculations are totally independent between groups, but each group need to be fed some arguments (all_max here).

library(tidyverse)

times <- 1e5
cols <- 4
df3 <- as.data.frame(x = matrix(rnorm(times * cols, mean = 5), ncol = cols)) %>% rename(A = V1, B = V2, C = V3, X = V4)

df3 <- cbind(grp = rep(seq_len(1e3), each = 100), df3) %>% 
   group_by(grp)

system.time(
  df3 %>% 
    group_map(~
    { 
      all_max <- summarise_at(., vars(A:C), max) %>% mutate(X = rowMeans(.))
      map2_df(., all_max, ~match(TRUE, .x < 0.5 * .y))
    }
    )
)
#>    user  system elapsed 
#>    3.87    0.00    3.98

system.time(
  df3 %>% summarise_at(vars(A:C), max) %>% mutate(X = rowMeans(.))
)
#>    user  system elapsed 
#>    0.02    0.00    0.01

^{Created on 2019-04-05 by the reprex package (v0.2.1)}

Col · April 7, 2019, 11:17am

At first I thought it wasn't apples-to-apples as the group_map includes the other part of your original question. However, even if you take out the map2_df it's almost just as slow.

The problem seems to be split up as:

Get summary table
Determine first position of group less than the summary value of that column.

#Step 0: generate data

library(tidyverse)

times <- 1e5
cols <- 4

#changed rnorm average to 0 so some values are less than 0
df3 <- as.data.frame(x = matrix(rnorm(times * cols), ncol = cols)) %>% rename(A = V1, B = V2, C = V3, X = V4)
df3 <- cbind(grp = rep(seq_len(1e3), each = 100), df3)

#Step 1: general summary table

 all_max <- df3 %>% group_by(grp) %>%
   summarise_at(vars(A:C), max) %>% 
   mutate(X = rowMeans(.))

#Step 2: Compare

inner_join(df3, all_max, by = 'grp')%>%
  group_by(grp)%>%
  # transmute_at(paste0(letter_grps, '.x'), funs(0.5 * !!str_replace(names(.), 'x', 'y') - .))
  transmute(A = 0.5* A.y - A.x  ,
            B = 0.5* B.y - B.x  ,
            C = 0.5* C.y - C.x  ,
            X = 0.5* X.y - X.x )%>%
  summarize_at(vars(A:X), ~match(TRUE, . < 0))

The system time on my machine is 0.47 seconds for the join conditions. To make the lookup table, 0 seconds.

I think you could make it faster with data.table.

Dong · April 9, 2019, 7:15am

Thanks. Yes, it is a clear improvement. data.table is even faster using the same approach (SO answer)

system.time(
   df3 %>% summarise_at(vars(A:C), max) %>% mutate(X = rowMeans(.)) %>% 
     left_join(df3, by = "grp" ) %>% 
     group_by(grp) %>% 
     transmute(A = 0.5 * A.x - A.y,
               B = 0.5 * B.x - B.y,
               C = 0.5 * C.x - C.y,
               X = 0.5 * X.x - X.y ) %>%
     summarise_all(~ match(TRUE, .>=0 ))

#>   user  system elapsed 
#>   1.49    0.19    1.69

Col · April 9, 2019, 10:05am

That solution may have issues. Since you lose the grouping after the summarize_at() function, the rowMeans(.) includes the grp in the calculation.

If your grp is a character, you would be good to go. Otherwise, grp is included in the calculation. This is one way to fix it:

df3%>%
    group_by(grp) %>%
    summarise_at(vars(A:C), max) %>%
    mutate(X = rowMeans(.[, 2:4]))

I noticed then when trying to implement a data.table solution. It's about 3 times faster.

library(tidyverse)
library(data.table)

# Simulation setup --------------------------------------------------------

n_per_grp <- 100
grps <- 1000
cols <- 4

# Generate data -----------------------------------------------------------

DT <- as.data.table(matrix(rnorm(n_per_grp * grps * cols, 5), ncol = cols))
setnames(DT, c('A', 'B', 'C', 'X'))
DT[, grp := rep(seq_len(grps), each = 100)]

tib <- as_tibble(DT)

# data.Table Solution --------------------------------------------------------------

DT_Way <- function (DT1) {
  # Make Summary Table ------------------------------------------------------
  
  sum_DT <- DT1[ , lapply(.SD, max), .SDcols = c('A','B','C'), by = grp]
  sum_DT[ , X := rowMeans(.SD), by = grp]
  
  # Join back to original DT ------------------------------------------------
  
  DT1[sum_DT,on = 'grp', by=.EACHI,
      j = lapply(setNames(c('A','B','C','X'), c('A','B','C','X')),
                 function(x) get(x) - 0.5 * get(paste0("i.", x)))
      ][, lapply(.SD, function(x) (match(TRUE, x <0))), by = grp]
}


# dplyr solution ----------------------------------------------------------

dplyr_way <- function (df3) {
  
  all_max <- df3%>%
    group_by(grp) %>%
    summarise_at(vars(A:C), max) %>%
    mutate(X = rowMeans(.[, -1]))
  
  left_join(df3, all_max, by = 'grp')%>%
    group_by(grp)%>%
    transmute(A = A.x - 0.5* A.y ,
              B = B.x - 0.5* B.y  ,
              C = C.x - 0.5* C.y   ,
              X = X.x - 0.5* X.y  )%>%
    summarize_at(vars(A:X), ~match(TRUE, . < 0))
}

dplyr_way(tib)
DT_Way (DT)


identical(as.data.frame(dplyr_way(tib)), as.data.frame(DT_Way(DT)))

library(microbenchmark)

microbenchmark(dplyr_way(tib), DT_Way(DT), times = 20)

Performance:

> microbenchmark(dplyr_way(tib), DT_Way(DT), times = 20)
Unit: milliseconds
           expr      min       lq     mean   median       uq     max neval
 dplyr_way(tib) 456.4055 458.7190 464.7088 459.2215 460.4009 536.389    20
     DT_Way(DT) 128.5559 129.7499 131.8996 131.0867 134.2225 136.661    20

Dong · April 9, 2019, 6:28pm

Thanks. This is close to what I need. I notice that combining the two steps (see below) can reduce memory allocation by half, which would help when dealing with really large data set.

  DT1[sum_DT,on = 'grp', by=.EACHI,
     j = lapply(setNames(c('A','B','C','X'), c('A','B','C','X')),
                function(x) match(TRUE, get(x) < 0.5 * get(paste0("i.", x))))]

One remaining question: I have indeterminate number of A:C type of columns, besides the "X and multiple grp type columns; do you have a way to do setNames for all of them except the grp columns?

Col · April 10, 2019, 10:33am

If your data is set up with 3 columns needing maxes and every 4th column taking the average of the 3 maxes before it, this gets close.

It just calls all the variables 'A1', 'B1', etc. and does everything else as before. The only thing I can't figure out is how to assign X1, X2, etc.

library(tidyverse)
library(data.table)

# Simulation setup --------------------------------------------------------

n_per_grp <- 100
grps <- 1000
cols <- 4
n_sets <- 2

# column name setup -------------------------------------------------------

A_B_C <- LETTERS[1:3]

fields <- c(A_B_C, 'X')

All_ABC <- paste0(rep(A_B_C, n_sets), rep(1:n_sets, each = length(A_B_C)))
All_X <- paste0(rep('X', n_sets), rep(1:n_sets, each = length('X')))
All_Letters <- paste0(rep(fields, n_sets), rep(1:n_sets, each = length(fields)))

# Generate data -----------------------------------------------------------

DT <- as.data.table(matrix(rnorm(n_per_grp * grps * cols* n_sets, 5), ncol = cols * n_sets))
setnames(DT, All_Letters)
DT[, grp := rep(seq_len(grps), each = 100)]


# Summarize data ----------------------------------------------------------

sum_DT <- DT[ , lapply(.SD, max), .SDcols = All_ABC, by = grp]

##This line needs modified to be more generalized
x <- list(1:3 + 1, 4:6 + 1)

##this works but doesn't assign back to the DT as X1, X2
lapply(x, function (y) rowMeans(sum_DT[, ..y]))

##This is what needs fixed

#sum_DT[ , lapply(x, function (y) rowMeans(sum_DT[, ..y])), by = grp]

# Join back to original DT ------------------------------------------------

DT[sum_DT,on = 'grp', by=.EACHI,
    j = lapply(setNames(All_Letters, All_Letters),
               function(x) match(TRUE, get(x) < 0.5 * get(paste0("i.", x))))]

See also:

Col · April 10, 2019, 11:11am

I figured it out! I assigned the X to a separate dt, set the names of the dt, and the combined it back to the sum_DT.

I also found a way to do the list with a long split assignment.

It seems to work when I change any of the variables at the top.

library(data.table)

# Simulation setup --------------------------------------------------------

n_per_grp <- 100
grps <- 1000
cols <- 4
n_sets <- 2

# column name setup -------------------------------------------------------

A_B_C <- LETTERS[1:3]

fields <- c(A_B_C, 'X')

All_ABC <- paste0(rep(A_B_C, n_sets), rep(1:n_sets, each = length(A_B_C)))
All_X <- paste0(rep('X', n_sets), rep(1:n_sets, each = length('X')))
All_Letters <- paste0(rep(fields, n_sets), rep(1:n_sets, each = length(fields)))

# Generate data -----------------------------------------------------------

DT <- as.data.table(matrix(rnorm(n_per_grp * grps * cols* n_sets, 5), ncol = cols * n_sets))
setnames(DT, All_Letters)
DT[, grp := rep(seq_len(grps), each = 100)]


# Summarize data ----------------------------------------------------------

sum_DT <- DT[ , lapply(.SD, max), .SDcols = All_ABC, by = grp]

x<- split((1:(length(A_B_C)*n_sets))+1, rep(1:n_sets, each = length(A_B_C)))

sum_X <- as.data.table(
  do.call(cbind,lapply(x, function (y) rowMeans(sum_DT[, ..y])))
)
setnames(sum_X, All_X)

sum_DT <- cbind(sum_DT, sum_X)

# Join back to original DT ------------------------------------------------

DT[sum_DT,on = 'grp', by=.EACHI,
    j = lapply(setNames(All_Letters, All_Letters),
               function(x) match(TRUE, get(x) < 0.5 * get(paste0("i.", x))))]

andresrcs · April 10, 2019, 12:56pm

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

Dong · April 10, 2019, 6:07pm

Sorry, this is not what I need (thankfully). The case is that it could be "A, B, C, X", or "A, X", or "A, B, C, D, E, X". I just find a solution to this and will post.

Dong · April 10, 2019, 6:31pm

Below is my solution to accommodate indeterminate number of data columns. Many thanks to @Col 's suggestions for the join approach and the example of setName, the latest satisfies all my requirement. While it is not as easy to read as the group_map solution, the performance gain is worth it for me.

library(data.table)

set.seed(1)
times <- 1e5
cols <- 4
df3 <- as.data.frame(x = matrix(rnorm(times * cols, mean = 5), ncol = cols)) 

df3 <- cbind(grp = rep(seq_len(1e3), each = 100), df3) 

DT <- setDT(df3)

setnames(DT, c("V1", "V2", "V3", 'V4'), c("A", "B", "C", "X"))

DT_way2 <- function(DT1) {
  data_cols <- setdiff(colnames(DT1), c("grp"))
  normal_data_cols <- setdiff(data_cols, c("X"))
  sum_DT <- DT1[ , lapply(.SD, max), .SDcols = normal_data_cols, by = grp]
  sum_DT[ , X := rowMeans(.SD), by = grp]
  DT1[sum_DT,on = 'grp', by = .EACHI,
     j = lapply(setNames(data_cols, data_cols),
                function(x) match(TRUE, get(x) < 0.5 * get(paste0("i.", x))))]
}

DT_way2(DT)
#>        grp  A  B  C  X
#>    1:    1 14 16  9 11
#>    2:    2 34  4 21  1
#>    3:    3  5 11  8  3
#>    4:    4 24 17  1  3
#>    5:    5  3 16 12 10
#>   ---                 
#>  996:  996  2  9  5  5
#>  997:  997  6 21 18 17
#>  998:  998  3  3 13 19
#>  999:  999  1  4  5  9
#> 1000: 1000  1  2  2 13

Col · April 12, 2019, 10:47am

Here's more-or-less the same solution except this embraces the default naming of V1, V2, V3.... This will always keep the 'X' column as the right-most column.

At the end, it sets the column names to letters but it is optional and soft fails if there are more columns than 24 because 'X' gets repeated and then you run out of letters.

 library(data.table)

# Simulation setup --------------------------------------------------------

n_per_grp <- 100
grps <- 1000
cols <- 7

# Generate data -----------------------------------------------------------

DT <- as.data.table(matrix(rnorm(n_per_grp * grps * cols, 5), ncol = cols))
DT[, grp := rep(seq_len(grps), each = 100)]

# determine how many columns ----------------------------------------------

dt_cols <- ncol(DT)
dt_names <- paste0('V',1:(dt_cols-1)) #-1 because of 'grp'

# Make Summary Table ------------------------------------------------------
  
sum_DT <- DT[ , lapply(.SD, max), .SDcols = dt_names[1:(dt_cols-2)], by = grp]
sum_DT[ , paste0('V', dt_cols-1) := rowMeans(.SD), by = grp]

# Join back to original DT ------------------------------------------------
  
result_dt <- DT[sum_DT,on = 'grp', by=.EACHI,
    lapply(setNames(dt_names, dt_names),
           function(x) match(TRUE, get(x) < 0.5 * get(paste0("i.", x))))]

# Optional if you want to keep the naming conventions ---------------------

dt_letters <- c(LETTERS[1:(dt_cols - 2)],'X')
setnames(result_dt, c('grp', dt_letters))
result_dt

Results:

       grp  A  B  C  D  E  F  X
   1:    1  6  9  1  4 14  1  4
   2:    2  3  1  8 15 33 26 11
   3:    3  3  3  1  8  7 11 22
   4:    4  4 10  3  6 18 11 30
   5:    5  4  1 10  4  9  5 18
  ---                          
 996:  996  5  7  3 46  1 14 27
 997:  997 28  6  6 32  7  2  8
 998:  998  2 17 18 12 17 12 37
 999:  999  4  2  9  1 68  2  2
1000: 1000 14 25 40  4 11  2 10

system · April 19, 2019, 10:47am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.