A project I'm working on requires me to work with times of the day. For this I'm using the hms package and am facing very slow performance when using as_hms() on a grouped tibble with ~10,000 groups.
Here's a reprex on a tibble with 2,600 groups. The objective is to create a new variable baseline_time and fill it with the value in time in rows where the values in compare_col and col2 match (e.g. the first row where both variables contain the value a). If they don't match, return NA.
Since dplyr::if_else() is type strict and there is no NA_hms_ constant, I coerce NA into the appropriate type with as_hms(). This code snippet returns the correct result but takes over 2 seconds to run.
library(dplyr, warn.conflicts = FALSE)
library(hms)
set.seed(42)
df <- tibble(col1 = rep(paste0("G", 1:100), times = 26),
col2 = rep(letters, each = 100),
compare_col = rep(letters, times = 100),
time = as_hms(paste(sample(0:23, 2600, replace = TRUE),
rep(0, 2600), rep(0, 2600), sep = ":")))
df
#> # A tibble: 2,600 x 4
#> col1 col2 compare_col time
#> <chr> <chr> <chr> <time>
#> 1 G1 a a 16:00
#> 2 G2 a b 04:00
#> 3 G3 a c 00:00
#> 4 G4 a d 09:00
#> 5 G5 a e 03:00
#> 6 G6 a f 17:00
#> 7 G7 a g 16:00
#> 8 G8 a h 14:00
#> 9 G9 a i 23:00
#> 10 G10 a j 06:00
#> # ... with 2,590 more rows
bench::system_time(
df %>%
group_by(col1, col2) %>%
mutate(baseline_time = if_else(compare_col == col2, time, as_hms(NA))) %>%
glimpse()
)
#> Rows: 2,600
#> Columns: 5
#> Groups: col1, col2 [2,600]
#> $ col1 <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9",...
#> $ col2 <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a"...
#> $ compare_col <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"...
#> $ time <time> 16:00:00, 04:00:00, 00:00:00, 09:00:00, 03:00:00, 17...
#> $ baseline_time <time> 16:00:00, NA, NA, NA, NA, ...
#> process real
#> 2.33s 2.34s
With base::ifelse() , I get an undesired double vector instead (probably because ifelse() doesn't know how to handle hms objects) but it runs about 8 times faster.
bench::system_time(
df %>%
group_by(col1, col2) %>%
mutate(baseline_time = ifelse(compare_col == col2, time, as_hms(NA))) %>%
glimpse()
)
#> Rows: 2,600
#> Columns: 5
#> Groups: col1, col2 [2,600]
#> $ col1 <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9",...
#> $ col2 <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a"...
#> $ compare_col <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"...
#> $ time <time> 16:00:00, 04:00:00, 00:00:00, 09:00:00, 03:00:00, 17...
#> $ baseline_time <dbl> 57600, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#> process real
#> 312ms 303ms
Created on 2020-07-14 by the reprex package (v0.3.0)
The issue only gets more pronounced as the number of groups increase. On my actual data set, this section of code takes about as long to run as the rest of my 900 line script!
So here's my question: Is as_hms() really that intensive an operation that it should be so much slower? Should I file a performance-related issue on the repo?