Hey folks,
I'm experiencing an error when using str_split_1()
to try to do substring matching within a mutate()
function. I've provided my real use-case and a reproducible example below. Any help would be much appreciated!
Real use-case
I have a tibble mapstats
with 2 columns and 129735 rows. The first 5 rows are:
card_short_name | drug_class |
---|---|
CblA-1 | cephalosporin |
SHV-52 | carbapenem;cephalosporin;penam |
dfrF | diaminopyrimidine antibiotic |
CTX-M-130 | cephalosporin |
NDM-6 | carbapenem;cephalosporin;cephamycin;penam |
And a list classes_to_match = c("penam", "macrolide antibiotic", "diaminopyrimidine antibiotic")
.
I want to create a new column matched
of type logical, where values are TRUE if at least one substring within drug_class
(delimited by the ';' character) matches at least one string within a list classes_to_match
, else FALSE.
First, I tried using str_split_1()
:
mapstats |>
mutate(matched = if_else(
any(str_split_1(drug_class, ";") %in% classes_to_match),
TRUE,
FALSE
))
Error in `mutate()`:
ℹ In argument: `matched = if_else(...)`.
Caused by error in `str_split_1()`:
! `string` must be a single string, not a character vector.
Then, I tried with str_split()
:
mapstats |>
mutate(matched = if_else(
any(unlist(str_split(drug_class, ";")) %in% classes_to_match),
TRUE,
FALSE
)) |>
count(matched)
# A tibble: 1 × 2
matched n
<lgl> <int>
1 TRUE 129735
When using str_split()
, all values for matched
are TRUE, so I'm wondering if the function isn't being run rowwise?
Reproducible example
library(tibble)
library(stringr)
library(dplyr)
set.seed(123)
# Function for obtaining random ;-delimited strings of letters
get_letters = function(x) {
sample(
LETTERS,
sample(1:10, 1)
) |>
paste(collapse = ";")
}
# Create a tibble with 5 rows
df = tibble(
"letter_str" = replicate(5, get_letters())
)
letters_to_match = c("B", "E", "R", "T")
> df
# A tibble: 5 × 1
letter_str
<chr>
1 N;C;J
2 V;K
3 T;N;V;E;S
4 Y;I;C;H;G;J;Z;S;D
5 K
> df |>
+ mutate(matched = if_else(
+ any(str_split_1(letter_str, ";") %in% letters_to_match),
+ TRUE,
+ FALSE
+ ))
Error in `mutate()`:
ℹ In argument: `matched = if_else(...)`.
Caused by error in `str_split_1()`:
! `string` must be a single string, not a character vector.
> df |>
+ mutate(matched = if_else(
+ any(unlist(str_split(letter_str, ";")) %in% letters_to_match),
+ TRUE,
+ FALSE
+ )) |>
+ count(matched)
# A tibble: 1 × 2
matched n
<lgl> <int>
1 TRUE 5