dtplyr and summarise_at

I have a

df %>% group_by() %>% summarise_at()

I've identified as the bottleneck in my code with the profiler.

I was able to get dtplyr to work by

df %>% lazy_dt() %>% group_by() %>% summarise_at() %> % as_tibble()

But I don't see any speed increase.

I noticed that summarise_at is not listed in the dtplyr reference. Is that the issue? Or maybe that my data frame isn't large enough to show a benefit? I don't have a lot of data.table experience to know. Many thanks!

I might start by looking at what query is generated by dtplyr by using show_query()

is your data particularly large? are you summarisation functions particularly compute intensive ? What magnitude of performance issue are you dealing with. I.e. how slow is it and do you have reason to expect it could / should be faster.

P.s. I believe dtplyr 1.2 released today, so you might benefit from an upgrade if you don't have it yet.

No, my data is pretty small (~100 rows) but there are tibble manipulations within a function that's called to produce an objective function for a nonlinear optimization.

That function ends up getting called hundreds of times to produce the gradient.

I would focus therefore on whether the function could be optimised for speed

That's why I'm trying to use dtplyr. The only optimization left is a group by, summarise.

Are you saying that the function that you are summarising by you would consider optimal and can't be made more efficient ? and now you are going further down to the calling code that repeats it ?
I suppose I'm just surprised.

I'm summarizing only by first() and last() and I have a mutate that only calls lag().

These things are all well optimised in the latest dplyr its very competative with dt out of the box. (it seems to me).

 library(tidyverse)
library(dtplyr)
library(microbenchmark)
library(data.table)
library(waldo)

bigdata <- sample_n(mtcars,
  size = 1 * 10^6,
  replace = TRUE
)
#


microbenchmark(
  d1 = d1 <- bigdata %>% 
    group_by(cyl, gear) %>% 
    summarise_at(
    .vars = c("disp", "hp", "drat"),
    .funs = ~ mean(. - lag(.), na.rm = TRUE)
  ),
  d2 = d2 <- bigdata %>% lazy_dt() %>% 
    group_by(cyl, gear) %>% summarise_at(
    .vars = c("disp", "hp", "drat"),
    .funs = ~ mean(. - lag(.), na.rm = TRUE)
  ) %>% as_tibble(),
  d3 = d3 <- as.data.table(bigdata)[, .(
    disp = mean(disp - lag(disp), na.rm = TRUE),
    hp = mean(hp - lag(hp), na.rm = TRUE), drat = mean(drat -
      lag(drat), na.rm = TRUE)
  ), keyby = .(cyl, gear)],
  d4 = d4 <- bigdata %>% 
    group_by(cyl, gear) %>% 
    summarise(across(c("disp", "hp", "drat"),
    .fns = ~ mean(. - lag(.), na.rm = TRUE)
  ), .groups = "keep"),
  times = 50L
)

waldo::compare(as.data.frame(d1), as.data.frame(d2))
waldo::compare(as.data.frame(d2), as.data.frame(d3))
waldo::compare(as.data.frame(d3), as.data.frame(d4))

results on my rig:

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval cld
   d1 107.4226 161.2159 200.7390 179.8761 213.4728 452.6526    50  a 
   d2 160.4514 219.1882 272.6245 259.6064 303.4121 473.9120    50   b
   d3 157.8407 213.4066 257.6683 230.4772 277.6532 460.3204    50   b
   d4 108.0887 153.7697 196.4661 175.0069 225.6072 347.4699    50  a 

so the dplyr code old style summarise at, and new style summarise across are competative with DT (unless the DT code is particularly poorly generated by dtplyr, which it might be, I dont know).

this example is slow for I think two reasons

  1. the data is relatively large and splitting it into processable groups is simply expensive on conventional hardware
  2. dplyr::lag is somewhat expensive low level function. of course its not summarising like mean but generates a lot of data (the original data offset by one) so probably theres a memory issue, or overheard in dplyr to make it generalise. if we compare it to a clumsy brute force lag we can see
microbenchmark(as.integer(lag(iris$Species)),c(NA_integer_,iris$Species[-150]))
waldo::compare(as.integer(lag(iris$Species)),c(NA_integer_,iris$Species[-150]))
Unit: microseconds
                               expr  min    lq    mean median     uq   max neval cld
      as.integer(lag(iris$Species)) 80.4 96.65 127.546 116.75 139.25 289.7   100   b
 c(NA_integer_, iris$Species[-150])  6.7  8.20  12.731  10.05  12.90  59.1   100  a 

√ No differences

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.