I have a large dataset ~6.2 million observations which I need to condense down by calculating the mean and quantiles for ~25k groups.
The original dplyr code looks like this and take about 25-30 s to run. (I have to run this code like 400+ times so yeah 30 seconds is a big deal)
edf_data = values_data %>%
select(region, year, value) %>%
group_by(region, year) %>%
summarise(mean = mean(value),
q95 = quantile(value, 0.95),
q90 = quantile(value, 0.90),
q75 = quantile(value, 0.75),
q66 = quantile(value, 0.66),
q50 = quantile(value, 0.50),
q33 = quantile(value, 0.33),
q25 = quantile(value, 0.25),
q10 = quantile(value, 0.10),
q5 = quantile(value, 0.05))
My plan was to use furrr to parallelize this call but it actually takes longer.
I first tested just setting up the formula with purr and only calculating the mean which took about the same time to run as calculating just the mean in a normal summarize format roughly 2s.
edf_data = values_data %>%
select(region, year, value) %>%
group_nest(region, year) %>%
mutate(mean = map_dbl(data, ~mean(.x$value)))
But when I changed to future_map_dbl it now takes over a minute which is longer than the original code.
plan(multisession)
edf_data = values_data %>%
select(region, year, value) %>%
group_nest(region, year) %>%
mutate(mean = future_map_dbl(data, ~mean(.x$value)))
I am guessing that my code was slowed down by having to shunt around 6 million rows to different threads but I don't know how to fix it. Does any one have better ideas for implementing furrr to solve a summarize problem? Or how to parallelize this code in general?
Abstractly I recognize it would be great if I could get both the calculations within a group and between groups running in parallel since they are completely independent of each other. But I have no clue how to implement that. Also I like my original dplyr set up because the output is very clean. I have seen examples of parallelization where the output which you have to reformat and combine back into the main dataset increasing the chances of bugs.