dtplyr and summarise_at

arthur.t · December 6, 2021, 8:10pm

I have a

df %>% group_by() %>% summarise_at()

I've identified as the bottleneck in my code with the profiler.

I was able to get dtplyr to work by

df %>% lazy_dt() %>% group_by() %>% summarise_at() %> % as_tibble()

But I don't see any speed increase.

I noticed that summarise_at is not listed in the dtplyr reference. Is that the issue? Or maybe that my data frame isn't large enough to show a benefit? I don't have a lot of data.table experience to know. Many thanks!

nirgrahamuk · December 6, 2021, 11:20pm

I might start by looking at what query is generated by dtplyr by using show_query()

is your data particularly large? are you summarisation functions particularly compute intensive ? What magnitude of performance issue are you dealing with. I.e. how slow is it and do you have reason to expect it could / should be faster.

P.s. I believe dtplyr 1.2 released today, so you might benefit from an upgrade if you don't have it yet.

arthur.t · December 7, 2021, 1:12am

No, my data is pretty small (~100 rows) but there are tibble manipulations within a function that's called to produce an objective function for a nonlinear optimization.

That function ends up getting called hundreds of times to produce the gradient.

nirgrahamuk · December 7, 2021, 9:44am

I would focus therefore on whether the function could be optimised for speed

arthur.t · December 10, 2021, 2:48pm

That's why I'm trying to use dtplyr. The only optimization left is a group by, summarise.

nirgrahamuk · December 10, 2021, 2:53pm

Are you saying that the function that you are summarising by you would consider optimal and can't be made more efficient ? and now you are going further down to the calling code that repeats it ?
I suppose I'm just surprised.

arthur.t · December 10, 2021, 2:58pm

I'm summarizing only by first() and last() and I have a mutate that only calls lag().

nirgrahamuk · December 10, 2021, 4:59pm

These things are all well optimised in the latest dplyr its very competative with dt out of the box. (it seems to me).

 library(tidyverse)
library(dtplyr)
library(microbenchmark)
library(data.table)
library(waldo)

bigdata <- sample_n(mtcars,
  size = 1 * 10^6,
  replace = TRUE
)
#


microbenchmark(
  d1 = d1 <- bigdata %>% 
    group_by(cyl, gear) %>% 
    summarise_at(
    .vars = c("disp", "hp", "drat"),
    .funs = ~ mean(. - lag(.), na.rm = TRUE)
  ),
  d2 = d2 <- bigdata %>% lazy_dt() %>% 
    group_by(cyl, gear) %>% summarise_at(
    .vars = c("disp", "hp", "drat"),
    .funs = ~ mean(. - lag(.), na.rm = TRUE)
  ) %>% as_tibble(),
  d3 = d3 <- as.data.table(bigdata)[, .(
    disp = mean(disp - lag(disp), na.rm = TRUE),
    hp = mean(hp - lag(hp), na.rm = TRUE), drat = mean(drat -
      lag(drat), na.rm = TRUE)
  ), keyby = .(cyl, gear)],
  d4 = d4 <- bigdata %>% 
    group_by(cyl, gear) %>% 
    summarise(across(c("disp", "hp", "drat"),
    .fns = ~ mean(. - lag(.), na.rm = TRUE)
  ), .groups = "keep"),
  times = 50L
)

waldo::compare(as.data.frame(d1), as.data.frame(d2))
waldo::compare(as.data.frame(d2), as.data.frame(d3))
waldo::compare(as.data.frame(d3), as.data.frame(d4))

results on my rig:

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
   d1 107.4226 161.2159 200.7390 179.8761 213.4728 452.6526    50  a 
   d2 160.4514 219.1882 272.6245 259.6064 303.4121 473.9120    50   b
   d3 157.8407 213.4066 257.6683 230.4772 277.6532 460.3204    50   b
   d4 108.0887 153.7697 196.4661 175.0069 225.6072 347.4699    50  a

so the dplyr code old style summarise at, and new style summarise across are competative with DT (unless the DT code is particularly poorly generated by dtplyr, which it might be, I dont know).

this example is slow for I think two reasons

the data is relatively large and splitting it into processable groups is simply expensive on conventional hardware
dplyr::lag is somewhat expensive low level function. of course its not summarising like mean but generates a lot of data (the original data offset by one) so probably theres a memory issue, or overheard in dplyr to make it generalise. if we compare it to a clumsy brute force lag we can see

microbenchmark(as.integer(lag(iris$Species)),c(NA_integer_,iris$Species[-150]))
waldo::compare(as.integer(lag(iris$Species)),c(NA_integer_,iris$Species[-150]))

nirgrahamuk · December 10, 2021, 5:00pm

Unit: microseconds
                               expr  min    lq    mean median     uq   max neval cld
      as.integer(lag(iris$Species)) 80.4 96.65 127.546 116.75 139.25 289.7   100   b
 c(NA_integer_, iris$Species[-150])  6.7  8.20  12.731  10.05  12.90  59.1   100  a 

√ No differences

system · December 31, 2021, 5:00pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.