I'm processing large datasets, where I mark each row if it would be filtered out by candidate filter, and then I want to have a catch all filter if any of the other filters are TRUE.
So, this is how I create the individual filters
filtered_tbl <- expression_tbl |>
dplyr::mutate(
f_LowDP = DP < 100,
f_LowAltDP = AD_Alt < 5,
f_LowAF = AF < 0.02,
f_HighAF = AF >= 0.95)
Now, if I know what filters I'm using, I can explicitly state them
filtered_tbl <- expression_tbl |>
dplyr::mutate(
f_LowDP = DP < 100,
f_LowAltDP = AD_Alt < 5,
f_LowAF = AF < 0.02,
f_HighAF = AF >= 0.95,
filterEWES = f_LowDP | f_LowAltDP | f_LowAF | f_HighAF
)
However, that requires me to specify it explicitly, and if I add or remove the filters, I have to change the global code.
So, I tried using rowwise(), and ended up with
filtered_tbl <- expression_tbl |>
dplyr::mutate(
f_LowDP = DP < 100,
f_LowAltDP = AD_Alt < 5,
f_LowAF = AF < 0.02,
f_HighAF = AF >= 0.95) |>
dplyr::rowwise() |>
dplyr::mutate(
filterEWES = any(dplyr::c_across(dplyr::starts_with("f_", ignore.case = F)),
na.rm = T)
) |>
dplyr::ungroup()
This is super slow, so it seems I'm doing something wrong.
As a comparison, if I do this
filtered_tbl2 <- expression_tbl |>
dplyr::mutate(
f_LowDP = DP < 100,
f_LowAltDP = AD_Alt < 5,
f_LowAF = AF < 0.02,
f_HighAF = AF >= 0.95)
filtered_tbl2 <-
dplyr::bind_cols(filtered_tbl2,
filterEWES = rowSums(
filtered_tbl2 |>
dplyr::select(dplyr::starts_with("f_",
ignore.case = F)),
na.rm = T
) > 0)
It is very fast.
Is there a tidyverse way to specify the columns by dplyr::select() and have it be processed fast?
Thank you,
Uri David