Using any with rowwise() seems slow

uridavid.akavia · April 25, 2024, 2:52pm

I'm processing large datasets, where I mark each row if it would be filtered out by candidate filter, and then I want to have a catch all filter if any of the other filters are TRUE.

So, this is how I create the individual filters

filtered_tbl <- expression_tbl |>
  dplyr::mutate(
        f_LowDP = DP < 100,
        f_LowAltDP = AD_Alt < 5,
        f_LowAF = AF < 0.02,
        f_HighAF = AF >= 0.95)

Now, if I know what filters I'm using, I can explicitly state them

filtered_tbl <- expression_tbl |>
  dplyr::mutate(
        f_LowDP = DP < 100,
        f_LowAltDP = AD_Alt < 5,
        f_LowAF = AF < 0.02,
        f_HighAF = AF >= 0.95,
        filterEWES = f_LowDP | f_LowAltDP | f_LowAF | f_HighAF
)

However, that requires me to specify it explicitly, and if I add or remove the filters, I have to change the global code.

So, I tried using rowwise(), and ended up with

filtered_tbl <- expression_tbl |>
dplyr::mutate(
        f_LowDP = DP < 100,
        f_LowAltDP = AD_Alt < 5,
        f_LowAF = AF < 0.02,
        f_HighAF = AF >= 0.95) |> 
  dplyr::rowwise() |>
  dplyr::mutate(
        filterEWES = any(dplyr::c_across(dplyr::starts_with("f_", ignore.case = F)),
                         na.rm = T)
      ) |>
  dplyr::ungroup()

This is super slow, so it seems I'm doing something wrong.

As a comparison, if I do this

filtered_tbl2 <- expression_tbl |>
  dplyr::mutate(
        f_LowDP = DP < 100,
        f_LowAltDP = AD_Alt < 5,
        f_LowAF = AF < 0.02,
        f_HighAF = AF >= 0.95)

filtered_tbl2 <-
  dplyr::bind_cols(filtered_tbl2,
                   filterEWES = rowSums(
                     filtered_tbl2 |>
                       dplyr::select(dplyr::starts_with("f_",
                                                        ignore.case = F)),
                     na.rm = T
                   ) > 0)

It is very fast.
Is there a tidyverse way to specify the columns by dplyr::select() and have it be processed fast?

Thank you,

Uri David

dromano · April 25, 2024, 3:05pm

If you change any() to sum(), does that make a difference?

uridavid.akavia · April 25, 2024, 3:14pm

It is still very slow. Not sure if equally slow, but slow enough I stopped it before it finished. I also stopped the version with any before it finished, so both are slow.

dromano · April 25, 2024, 3:30pm

I'm not sure what you meant by "a tidyverse way", but here is an alternative version of your fast code:

filtered_tbl2 <- expression_tbl |>
  dplyr::mutate(
    f_LowDP = DP < 100,
    f_LowAltDP = AD_Alt < 5,
    f_LowAF = AF < 0.02,
    f_HighAF = AF >= 0.95
  ) %>%
  dplyr::mutate( 
    filterEWES = 
      rowSums(
        . |> dplyr::select(dplyr::starts_with("f_", ignore.case = F)),
        na.rm = T
      ) > 0
  )

The important difference is the use of the magrittr pipe, %>%, before the second mutate() command, which allows the use of the newly made f_* columns.

system · July 24, 2024, 3:30pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.