Part of an analysis script at work was running extremely slowly and after trying a few things we found a huge speed boost from just moving a small piece of code outside of the filter statement. The goal is just to filter out a few brands from that part of the analysis. I've used the midwest dataset below make a reprex. The problem is solved in the sense that it runs fast enough for us to use now, but I'm very curious as to why there's such a dramatic speed boost from doing the same thing and saving it as a separate list before the filter statement.
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.3
#> Warning: package 'purrr' was built under R version 3.5.2
#> Warning: package 'dplyr' was built under R version 3.5.3
library(bench)
#> Warning: package 'bench' was built under R version 3.5.3
#load midwest dataset from tidyverse
all_county_df <- midwest
#define some of the counties we want to exclude later
some_county_df <- all_county_df %>%
filter(county %in% unique(all_county_df$county[1:40]))
#exclude counties with a unique call inside the filter statement
inside_filter <- function() {
df_filtered <- all_county_df %>%
filter(!(county %in% unique(some_county_df$county)))
}
#exclude counties with a unique call outside the filter statement
outside_filter <- function() {
some_county_list <- unique(some_county_df$county)
df_filtered <- all_county_df %>%
filter(!(county %in% some_county_list))
}
times <- bench::mark(
inside_filter(),
outside_filter()
)
#total time for inside_filter is 1.01 s
#total time for outside_filter is 450.05 ms
```
<sup>Created on 2019-06-11 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
```
Thanks for running this. I installed the dev version of dplyr and also saw the time difference go away (if anything inside_filter() may now be slightly faster
Since I don't have a work-around for the group_by, I'll have to continue using 0.7.8 and the outside_filter() in my analysis code. But it is good to know that the speed difference is some sort of quirk of an older version of dplyr.