These things are all well optimised in the latest dplyr its very competative with dt out of the box. (it seems to me).
library(tidyverse)
library(dtplyr)
library(microbenchmark)
library(data.table)
library(waldo)
bigdata <- sample_n(mtcars,
size = 1 * 10^6,
replace = TRUE
)
#
microbenchmark(
d1 = d1 <- bigdata %>%
group_by(cyl, gear) %>%
summarise_at(
.vars = c("disp", "hp", "drat"),
.funs = ~ mean(. - lag(.), na.rm = TRUE)
),
d2 = d2 <- bigdata %>% lazy_dt() %>%
group_by(cyl, gear) %>% summarise_at(
.vars = c("disp", "hp", "drat"),
.funs = ~ mean(. - lag(.), na.rm = TRUE)
) %>% as_tibble(),
d3 = d3 <- as.data.table(bigdata)[, .(
disp = mean(disp - lag(disp), na.rm = TRUE),
hp = mean(hp - lag(hp), na.rm = TRUE), drat = mean(drat -
lag(drat), na.rm = TRUE)
), keyby = .(cyl, gear)],
d4 = d4 <- bigdata %>%
group_by(cyl, gear) %>%
summarise(across(c("disp", "hp", "drat"),
.fns = ~ mean(. - lag(.), na.rm = TRUE)
), .groups = "keep"),
times = 50L
)
waldo::compare(as.data.frame(d1), as.data.frame(d2))
waldo::compare(as.data.frame(d2), as.data.frame(d3))
waldo::compare(as.data.frame(d3), as.data.frame(d4))
results on my rig:
Unit: milliseconds
expr min lq mean median uq max neval cld
d1 107.4226 161.2159 200.7390 179.8761 213.4728 452.6526 50 a
d2 160.4514 219.1882 272.6245 259.6064 303.4121 473.9120 50 b
d3 157.8407 213.4066 257.6683 230.4772 277.6532 460.3204 50 b
d4 108.0887 153.7697 196.4661 175.0069 225.6072 347.4699 50 a
so the dplyr code old style summarise at, and new style summarise across are competative with DT (unless the DT code is particularly poorly generated by dtplyr, which it might be, I dont know).
this example is slow for I think two reasons
- the data is relatively large and splitting it into processable groups is simply expensive on conventional hardware
- dplyr::lag is somewhat expensive low level function. of course its not summarising like mean but generates a lot of data (the original data offset by one) so probably theres a memory issue, or overheard in dplyr to make it generalise. if we compare it to a clumsy brute force lag we can see
microbenchmark(as.integer(lag(iris$Species)),c(NA_integer_,iris$Species[-150]))
waldo::compare(as.integer(lag(iris$Species)),c(NA_integer_,iris$Species[-150]))