Sometimes I want to apply a function to a dataframe that involves increasing the number of rows hugely before immediately being reduced by another operation (aggregation, filtering, etc.).
Often this situation arises when I'm trying to keep my data pipeline tidy, rather than using a wide format.
I've come to use a workaround: splitting the dataframe's into chunks and applying the function to each chunk individually. However, this makes for harder to read code and overall feels dirty.
Has anyone come up with a better solution?
Slightly contrived reprex to show what I mean:
suppressPackageStartupMessages(library(tidyverse))
# Generate some data
n <- 500
data1 <- tibble(x = 1:n, f = runif(n))
reps <- 1000
data2 <- tibble(x = rep(1:n, reps), y = runif(n * reps))
# What I want to do:
# However, this can take a long time when both `n` and `reps` are large
# Or worse, it can use up all my available RAM during the join
data1 %>%
left_join(data2, by = "x") %>%
group_by(x) %>%
summarise(z = sum(f * x))
#> # A tibble: 500 x 2
#> x z
#> <int> <dbl>
#> 1 1 918.
#> 2 2 502.
#> 3 3 2726.
#> 4 4 3087.
#> 5 5 775.
#> 6 6 2440.
#> 7 7 5605.
#> 8 8 5928.
#> 9 9 1802.
#> 10 10 6045.
#> # ... with 490 more rows
# My workaround: Split data into groups and apply to groups separately,
# before combining again
n_groups <- 100
split_data <- split(data1, sample.int(n_groups, n, replace = TRUE))
map_dfr(
split_data,
. %>%
left_join(data2, by = "x") %>%
group_by(x) %>%
summarise(z = sum(f * x))
)
#> # A tibble: 500 x 2
#> x z
#> <int> <dbl>
#> 1 26 10721.
#> 2 47 32598.
#> 3 193 27723.
#> 4 36 6901.
#> 5 66 22052.
#> 6 216 169838.
#> 7 367 292813.
#> 8 376 93363.
#> 9 427 113632.
#> 10 64 31491.
#> # ... with 490 more rows
In general, you may find data.table faster when working with data sets that have large numbers of groups. In the majority of scenarios I have encountered, the execution speed of data.table is not worth forgoing the other benefits of using the tidyverse. But, in cases where the execution speed is worth pursuing, you can use data.table's[ operator to join the tables:
Thanks, Mara. This looks really interesting (and some R/tidyverse history thrown in as well!). Seems like split %>% map_dfr is the way to go. It gets a bit more messy if you want to add a progress bar but I believe that's being worked on(?).
Yes, I've looked at data.table before and come to the same conclusion as you. I think the design choices basically just aren't for me. Perhaps it's worth taking another look. Thanks.
Just using data.table for the join and then applying dplyr syntax is a bit strange in my eyes, particularly as it's the group_by operation which is very slow for large numbers of groups in dplyr.
The data.table code would simply be: d1[d2, .(z = sum(f * x)), x]