Note: The benchmarking below doesn't register positive execution times for ~75% of runs. If this is because I'm doing it wrong, please correct me!
tl;dr: No real difference. Toy examples are poor choices for comparing efficient packages which focus on different use-cases.
Equivalent code for dplyr
and data.table
:
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(microbenchmark)
data("diamonds", package = "ggplot2")
diamond_dt <- as.data.table(diamonds)
dplyr_process <- expression({
result <- diamonds %>%
group_by(color) %>%
nest() %>%
mutate(
model = lapply(
X = data,
FUN = lm,
formula = carat ~ cut + depth
),
graph = lapply(
X = data,
FUN = function(d) {
ggplot(d, aes(cut, clarity, fill = price)) + geom_tile()
}
),
rsquare = vapply(
X = model,
FUN = function(m) summary(m)[["r.squared"]],
FUN.VALUE = numeric(1)
)
)
})
dt_process <- expression({
result <- diamond_dt[
,
list(
data = list(.SD),
model = list(lm(carat ~ cut + depth, data = .SD)),
graph = list(
ggplot(.SD, aes(cut, clarity, fill = price)) + geom_tile()
)
),
by = color
][
,
rsquare := vapply(
X = model,
FUN = function(m) summary(m)[["r.squared"]],
FUN.VALUE = numeric(1)
)
]
})
Benchmarking the run times:
microbenchmark(
dplyr = dplyr_process,
data.table = dt_process
)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# dplyr 0 0 0.15 0 0 1 100
# data.table 0 0 21.33 0 0 2117 100
# Warning message:
# In microbenchmark(dplyr = dplyr_process, data.table = dt_process) :
# Could not measure a positive execution time for 155 evaluations.
microbenchmark(
dplyr = dplyr_process,
data.table = dt_process
)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# dplyr 0 0 14.23 0 0 1411 100
# data.table 0 0 3.64 0 0 354 100
# Warning message:
# In microbenchmark(dplyr = dplyr_process, data.table = dt_process) :
# Could not measure a positive execution time for 163 evaluations.
In both cases, the execution time for over 3/4 of the runs could not be measured. The times were too short. Even if this is more a "problem" with my laptop and operating system, I'd say nanosecond-level differences are likely unimportant.
The max times are more influenced by the other processes I have running. By extension, the means show the same. The quartile measures are more reliable.
If your data is often the size of diamonds
(53,940 rows), then any problems in run time are unlikely to be affected by whether you use dplyr
or data.table
. But we can simulate much larger datasets by making a bigger version of diamonds
.
set.seed(100)
diamonds <- diamonds %>%
lapply(FUN = rep, times = 100) %>%
as_data_frame() %>%
mutate_all(sample, replace = TRUE)
diamond_dt <- as.data.table(diamonds)
microbenchmark(
dplyr = dplyr_process,
data.table = dt_process
)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# dplyr 0 0 31.86 0 0 2116 100
# data.table 0 0 21.34 0 0 2117 100
# Warning message:
# In microbenchmark(dplyr = dplyr_process, data.table = dt_process) :
# Could not measure a positive execution time for 151 evaluations.
With 5,394,000 rows, the 1st, 2nd, and 3rd quartiles are still 0. Unless you're doing a lot of munging, I don't think it matters what you use. And even if you do a lot of munging, the best choice is whatever's easiest to reason about and maintain.
Unless you hit RAM limits for copying data. Then try data.table
.
Session info:
sessionInfo()
# R version 3.5.1 (2018-07-02)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
#
# Matrix products: default
#
# locale:
# [1] LC_COLLATE=English_United States.1252
# [2] LC_CTYPE=English_United States.1252
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United States.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods
# [7] base
#
# other attached packages:
# [1] bindrcpp_0.2.2 microbenchmark_1.4-4 ggplot2_3.0.0
# [4] tidyr_0.8.1 dplyr_0.7.6 data.table_1.11.4
#
# loaded via a namespace (and not attached):
# [1] Rcpp_0.12.17 bindr_0.1.1 magrittr_1.5 tidyselect_0.2.4
# [5] munsell_0.4.3 colorspace_1.3-2 R6_2.2.2 rlang_0.2.0
# [9] plyr_1.8.4 tools_3.5.1 grid_3.5.1 gtable_0.2.0
# [13] withr_2.1.2 yaml_2.1.19 lazyeval_0.2.1 assertthat_0.2.0
# [17] tibble_1.4.2 purrr_0.2.5 glue_1.2.0 compiler_3.5.1
# [21] pillar_1.2.2 scales_0.5.0 pkgconfig_2.0.1