as_hms() performance on grouped tibbles

siddharthprabhu · July 14, 2020, 12:30pm

A project I'm working on requires me to work with times of the day. For this I'm using the hms package and am facing very slow performance when using as_hms() on a grouped tibble with ~10,000 groups.

Here's a reprex on a tibble with 2,600 groups. The objective is to create a new variable baseline_time and fill it with the value in time in rows where the values in compare_col and col2 match (e.g. the first row where both variables contain the value a). If they don't match, return NA.

Since dplyr::if_else() is type strict and there is no NA_hms_ constant, I coerce NA into the appropriate type with as_hms(). This code snippet returns the correct result but takes over 2 seconds to run.

library(dplyr, warn.conflicts = FALSE)
library(hms)

set.seed(42)

df <- tibble(col1 = rep(paste0("G", 1:100), times = 26),
             col2 = rep(letters, each = 100),
             compare_col = rep(letters, times = 100),
             time = as_hms(paste(sample(0:23, 2600, replace = TRUE), 
                                 rep(0, 2600), rep(0, 2600), sep = ":")))

df
#> # A tibble: 2,600 x 4
#>    col1  col2  compare_col time  
#>    <chr> <chr> <chr>       <time>
#>  1 G1    a     a           16:00 
#>  2 G2    a     b           04:00 
#>  3 G3    a     c           00:00 
#>  4 G4    a     d           09:00 
#>  5 G5    a     e           03:00 
#>  6 G6    a     f           17:00 
#>  7 G7    a     g           16:00 
#>  8 G8    a     h           14:00 
#>  9 G9    a     i           23:00 
#> 10 G10   a     j           06:00 
#> # ... with 2,590 more rows

bench::system_time(
  df %>% 
    group_by(col1, col2) %>% 
    mutate(baseline_time = if_else(compare_col == col2, time, as_hms(NA))) %>% 
    glimpse()
)
#> Rows: 2,600
#> Columns: 5
#> Groups: col1, col2 [2,600]
#> $ col1          <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9",...
#> $ col2          <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a"...
#> $ compare_col   <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"...
#> $ time          <time> 16:00:00, 04:00:00, 00:00:00, 09:00:00, 03:00:00, 17...
#> $ baseline_time <time> 16:00:00,       NA,       NA,       NA,       NA,   ...
#> process    real 
#>   2.33s   2.34s

With base::ifelse() , I get an undesired double vector instead (probably because ifelse() doesn't know how to handle hms objects) but it runs about 8 times faster.

bench::system_time(
  df %>% 
    group_by(col1, col2) %>% 
    mutate(baseline_time = ifelse(compare_col == col2, time, as_hms(NA))) %>% 
    glimpse()
)
#> Rows: 2,600
#> Columns: 5
#> Groups: col1, col2 [2,600]
#> $ col1          <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9",...
#> $ col2          <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a"...
#> $ compare_col   <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"...
#> $ time          <time> 16:00:00, 04:00:00, 00:00:00, 09:00:00, 03:00:00, 17...
#> $ baseline_time <dbl> 57600, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
#> process    real 
#>   312ms   303ms

^{Created on 2020-07-14 by the reprex package (v0.3.0)}

The issue only gets more pronounced as the number of groups increase. On my actual data set, this section of code takes about as long to run as the rest of my 900 line script!

So here's my question: Is as_hms() really that intensive an operation that it should be so much slower? Should I file a performance-related issue on the repo?

nirgrahamuk · July 15, 2020, 9:21am

Here are some timinings. I can reproduce your experience that combination of group_by and as_hms is slow at 1.3 seconds on my machine.

My first solution, might be seen as a dodge, since it relates to the particulars of the calculation and my recognition that given the narrow scope as given, there is no benefit/difference to grouping by at all.

However, just in case it is necessary, perhaps it makes sense to only use as_hms on ongroup results after simply copying the needing timings over. this works very fast also.

library(dplyr)
library(hms)
library(microbenchmark)
set.seed(42)

raw_data <- tibble(col1 = rep(paste0("G", 1:100), times = 26),
                   col2 = rep(letters, each = 100),
                   compare_col = rep(letters, times = 100),
                   text_time = paste(sample(0:23, 2600, replace = TRUE), 
                                     rep(0, 2600), rep(0, 2600), sep = ":"),
                   time = as_hms(text_time))

microbenchmark(df_as_hms = df_as_hms <- group_by(raw_data,
                                                 col1,col2) %>%
  mutate(baseline_time = if_else(compare_col == col2, time, as_hms(NA))) %>% ungroup(),
  times = 5L)
  

df_as_hms


microbenchmark(df_as_hms_no_group = df_as_hms_no_group <-raw_data %>%
                 mutate(baseline_time = if_else(compare_col == col2, time, as_hms(NA))),
               times = 5L)

df_as_hms_no_group
identical(df_as_hms,df_as_hms_no_group)

microbenchmark(df_as_hms_nasout = df_as_hms_nasout <- group_by(raw_data,
                                                 col1,col2) %>%
                 mutate(baseline_time = if_else(compare_col == col2, as.numeric(time), NA_real_)) %>% ungroup %>%
                 mutate(baseline_time=as_hms(baseline_time)),
               times = 5L)

identical(df_as_hms,df_as_hms_nasout)

siddharthprabhu · July 15, 2020, 10:00am

@nirgrahamuk Sometimes it helps to just have another set of eyes look at your problem! You are absolutely correct that the grouping makes no difference (tested on my actual data set as well). I was able to improve performance by simply moving the mutate() statement out of the group_by(). The workaround for the grouped case is also pretty nice. Thanks a lot for your help!

nirgrahamuk · July 15, 2020, 10:41am

you're very welcome, I'm happy to help, and learning myself through practice.

system · July 22, 2020, 10:42am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.