A project I'm working on requires me to work with times of the day. For this I'm using the hms
package and am facing very slow performance when using as_hms()
on a grouped tibble with ~10,000 groups.
Here's a reprex on a tibble with 2,600 groups. The objective is to create a new variable baseline_time
and fill it with the value in time
in rows where the values in compare_col
and col2
match (e.g. the first row where both variables contain the value a
). If they don't match, return NA
.
Since dplyr::if_else()
is type strict and there is no NA_hms_
constant, I coerce NA
into the appropriate type with as_hms()
. This code snippet returns the correct result but takes over 2 seconds to run.
library(dplyr, warn.conflicts = FALSE)
library(hms)
set.seed(42)
df <- tibble(col1 = rep(paste0("G", 1:100), times = 26),
col2 = rep(letters, each = 100),
compare_col = rep(letters, times = 100),
time = as_hms(paste(sample(0:23, 2600, replace = TRUE),
rep(0, 2600), rep(0, 2600), sep = ":")))
df
#> # A tibble: 2,600 x 4
#> col1 col2 compare_col time
#> <chr> <chr> <chr> <time>
#> 1 G1 a a 16:00
#> 2 G2 a b 04:00
#> 3 G3 a c 00:00
#> 4 G4 a d 09:00
#> 5 G5 a e 03:00
#> 6 G6 a f 17:00
#> 7 G7 a g 16:00
#> 8 G8 a h 14:00
#> 9 G9 a i 23:00
#> 10 G10 a j 06:00
#> # ... with 2,590 more rows
bench::system_time(
df %>%
group_by(col1, col2) %>%
mutate(baseline_time = if_else(compare_col == col2, time, as_hms(NA))) %>%
glimpse()
)
#> Rows: 2,600
#> Columns: 5
#> Groups: col1, col2 [2,600]
#> $ col1 <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9",...
#> $ col2 <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a"...
#> $ compare_col <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"...
#> $ time <time> 16:00:00, 04:00:00, 00:00:00, 09:00:00, 03:00:00, 17...
#> $ baseline_time <time> 16:00:00, NA, NA, NA, NA, ...
#> process real
#> 2.33s 2.34s
With base::ifelse()
, I get an undesired double
vector instead (probably because ifelse()
doesn't know how to handle hms
objects) but it runs about 8 times faster.
bench::system_time(
df %>%
group_by(col1, col2) %>%
mutate(baseline_time = ifelse(compare_col == col2, time, as_hms(NA))) %>%
glimpse()
)
#> Rows: 2,600
#> Columns: 5
#> Groups: col1, col2 [2,600]
#> $ col1 <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9",...
#> $ col2 <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a"...
#> $ compare_col <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"...
#> $ time <time> 16:00:00, 04:00:00, 00:00:00, 09:00:00, 03:00:00, 17...
#> $ baseline_time <dbl> 57600, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#> process real
#> 312ms 303ms
Created on 2020-07-14 by the reprex package (v0.3.0)
The issue only gets more pronounced as the number of groups increase. On my actual data set, this section of code takes about as long to run as the rest of my 900 line script!
So here's my question: Is as_hms()
really that intensive an operation that it should be so much slower? Should I file a performance-related issue on the repo?