Using list columns in data.table

nwerth · January 2, 2019, 3:55pm

Note: The benchmarking below doesn't register positive execution times for ~75% of runs. If this is because I'm doing it wrong, please correct me!

tl;dr: No real difference. Toy examples are poor choices for comparing efficient packages which focus on different use-cases.

Equivalent code for dplyr and data.table:

library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(microbenchmark)

data("diamonds", package = "ggplot2")
diamond_dt <- as.data.table(diamonds)

dplyr_process <- expression({
  result <- diamonds %>%
    group_by(color) %>%
    nest() %>%
    mutate(
      model = lapply(
        X       = data,
        FUN     = lm,
        formula = carat ~ cut + depth
      ),
      graph = lapply(
        X   = data,
        FUN = function(d) {
          ggplot(d, aes(cut, clarity, fill = price)) + geom_tile()
        }
      ),
      rsquare = vapply(
        X         = model,
        FUN       = function(m) summary(m)[["r.squared"]],
        FUN.VALUE = numeric(1)
      )
    )
})

dt_process <- expression({
  result <- diamond_dt[
    ,
    list(
      data  = list(.SD),
      model = list(lm(carat ~ cut + depth, data = .SD)),
      graph = list(
        ggplot(.SD, aes(cut, clarity, fill = price)) + geom_tile()
      )
    ),
    by = color
  ][
    ,
    rsquare := vapply(
      X         = model,
      FUN       = function(m) summary(m)[["r.squared"]],
      FUN.VALUE = numeric(1)
    )
  ]
})

Benchmarking the run times:

microbenchmark(
  dplyr      = dplyr_process,
  data.table = dt_process
)
# Unit: nanoseconds
#        expr min lq  mean median uq  max neval
#       dplyr   0  0  0.15      0  0    1   100
#  data.table   0  0 21.33      0  0 2117   100
# Warning message:
# In microbenchmark(dplyr = dplyr_process, data.table = dt_process) :
#   Could not measure a positive execution time for 155 evaluations.

microbenchmark(
  dplyr      = dplyr_process,
  data.table = dt_process
)
# Unit: nanoseconds
#        expr min lq  mean median uq  max neval
#       dplyr   0  0 14.23      0  0 1411   100
#  data.table   0  0  3.64      0  0  354   100
# Warning message:
# In microbenchmark(dplyr = dplyr_process, data.table = dt_process) :
#   Could not measure a positive execution time for 163 evaluations.

In both cases, the execution time for over 3/4 of the runs could not be measured. The times were too short. Even if this is more a "problem" with my laptop and operating system, I'd say nanosecond-level differences are likely unimportant.

The max times are more influenced by the other processes I have running. By extension, the means show the same. The quartile measures are more reliable.

If your data is often the size of diamonds (53,940 rows), then any problems in run time are unlikely to be affected by whether you use dplyr or data.table. But we can simulate much larger datasets by making a bigger version of diamonds.

set.seed(100)
diamonds <- diamonds %>%
  lapply(FUN = rep, times = 100) %>%
  as_data_frame() %>%
  mutate_all(sample, replace = TRUE)

diamond_dt <- as.data.table(diamonds)

microbenchmark(
  dplyr      = dplyr_process,
  data.table = dt_process
)
# Unit: nanoseconds
#        expr min lq  mean median uq  max neval
#       dplyr   0  0 31.86      0  0 2116   100
#  data.table   0  0 21.34      0  0 2117   100
# Warning message:
# In microbenchmark(dplyr = dplyr_process, data.table = dt_process) :
#   Could not measure a positive execution time for 151 evaluations.

With 5,394,000 rows, the 1st, 2nd, and 3rd quartiles are still 0. Unless you're doing a lot of munging, I don't think it matters what you use. And even if you do a lot of munging, the best choice is whatever's easiest to reason about and maintain.

Unless you hit RAM limits for copying data. Then try data.table.

Session info:

sessionInfo()
# R version 3.5.1 (2018-07-02)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
# [1] LC_COLLATE=English_United States.1252 
# [2] LC_CTYPE=English_United States.1252   
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C                          
# [5] LC_TIME=English_United States.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods  
# [7] base     
# 
# other attached packages:
# [1] bindrcpp_0.2.2       microbenchmark_1.4-4 ggplot2_3.0.0       
# [4] tidyr_0.8.1          dplyr_0.7.6          data.table_1.11.4   
# 
# loaded via a namespace (and not attached):
#  [1] Rcpp_0.12.17     bindr_0.1.1      magrittr_1.5     tidyselect_0.2.4
#  [5] munsell_0.4.3    colorspace_1.3-2 R6_2.2.2         rlang_0.2.0     
#  [9] plyr_1.8.4       tools_3.5.1      grid_3.5.1       gtable_0.2.0    
# [13] withr_2.1.2      yaml_2.1.19      lazyeval_0.2.1   assertthat_0.2.0
# [17] tibble_1.4.2     purrr_0.2.5      glue_1.2.0       compiler_3.5.1  
# [21] pillar_1.2.2     scales_0.5.0     pkgconfig_2.0.1