Substring matching with `str_split_1()` within `mutate()` throws an error

Oxygen4985 · August 21, 2024, 3:56pm

Hey folks,

I'm experiencing an error when using str_split_1() to try to do substring matching within a mutate() function. I've provided my real use-case and a reproducible example below. Any help would be much appreciated!

Real use-case

I have a tibble mapstats with 2 columns and 129735 rows. The first 5 rows are:

card_short_name	drug_class
CblA-1	cephalosporin
SHV-52	carbapenem;cephalosporin;penam
dfrF	diaminopyrimidine antibiotic
CTX-M-130	cephalosporin
NDM-6	carbapenem;cephalosporin;cephamycin;penam

And a list classes_to_match = c("penam", "macrolide antibiotic", "diaminopyrimidine antibiotic").

I want to create a new column matched of type logical, where values are TRUE if at least one substring within drug_class (delimited by the ';' character) matches at least one string within a list classes_to_match, else FALSE.

First, I tried using str_split_1():

mapstats |>  
  mutate(matched = if_else(
    any(str_split_1(drug_class, ";") %in% classes_to_match),
    TRUE,
    FALSE
  ))

Error in `mutate()`:
ℹ In argument: `matched = if_else(...)`.
Caused by error in `str_split_1()`:
! `string` must be a single string, not a character vector.

Then, I tried with str_split():

mapstats |> 
  mutate(matched = if_else(
    any(unlist(str_split(drug_class, ";")) %in% classes_to_match),
    TRUE,
    FALSE
  )) |> 
  count(matched)

# A tibble: 1 × 2
  matched      n
  <lgl>    <int>
1 TRUE    129735

When using str_split(), all values for matched are TRUE, so I'm wondering if the function isn't being run rowwise?

Reproducible example

library(tibble)
library(stringr)
library(dplyr)

set.seed(123)

# Function for obtaining random ;-delimited strings of letters
get_letters = function(x) {
  sample(
    LETTERS,
    sample(1:10, 1)
    ) |> 
    paste(collapse = ";")
}

# Create a tibble with 5 rows
df = tibble(
  "letter_str" = replicate(5, get_letters())
)

letters_to_match = c("B", "E", "R", "T")

> df
# A tibble: 5 × 1
  letter_str       
  <chr>            
1 N;C;J            
2 V;K              
3 T;N;V;E;S        
4 Y;I;C;H;G;J;Z;S;D
5 K

> df |>
+   mutate(matched = if_else(
+     any(str_split_1(letter_str, ";") %in% letters_to_match),
+     TRUE,
+     FALSE
+   ))
Error in `mutate()`:
ℹ In argument: `matched = if_else(...)`.
Caused by error in `str_split_1()`:
! `string` must be a single string, not a character vector.

> df |>
+   mutate(matched = if_else(
+     any(unlist(str_split(letter_str, ";")) %in% letters_to_match),
+     TRUE,
+     FALSE
+   )) |> 
+   count(matched)
# A tibble: 1 × 2
  matched     n
  <lgl>   <int>
1 TRUE        5

AlexisW · August 21, 2024, 10:57pm

mutate() expects a function that takes a column (or several) and returns a column of the same length. So it is only well suited for vectorized functions (noting that you can easily build a vectorized function with map_*() if you have a function that takes one value and returns one value).

With str_split_1(), you run into a first problem: it's a function that takes a single string as input (i.e. a vector of length 1). So when you call str_split_1(drug_class, ";") you are trying to pass the entire column drug_class, it refuses.

Indeed, the correct vectorized function is str_split(). And as you correctly noticed, it takes a vector (e.g. a column), and returns a list, where each list element is a new vector with the results of the split. The best is to try it outside of mutate() to see what it does:

tab
#> # A tibble: 5 × 2
#>   card_short_name drug_class                               
#>   <chr>           <chr>                                    
#> 1 CblA-1          cephalosporin                            
#> 2 SHV-52          carbapenem;cephalosporin;penam           
#> 3 dfrF            diaminopyrimidine antibiotic             
#> 4 CTX-M-130       cephalosporin                            
#> 5 NDM-6           carbapenem;cephalosporin;cephamycin;penam

str_split(tab$drug_class, ";")
#> [[1]]
#> [1] "cephalosporin"
#> 
#> [[2]]
#> [1] "carbapenem"    "cephalosporin" "penam"        
#> 
#> [[3]]
#> [1] "diaminopyrimidine antibiotic"
#> 
#> [[4]]
#> [1] "cephalosporin"
#> 
#> [[5]]
#> [1] "carbapenem"    "cephalosporin" "cephamycin"    "penam"

And when you unlist() this list, you get a single vector:

unlist(str_split(tab$drug_class, ";"))
#>  [1] "cephalosporin"                "carbapenem"                   "cephalosporin"               
#>  [4] "penam"                        "diaminopyrimidine antibiotic" "cephalosporin"               
#>  [7] "carbapenem"                   "cephalosporin"                "cephamycin"                  
#> [10] "penam"

But now is a problem: by grouping all these together, you lost the position in the initial data frame! You now have a list of length 10, from a dataframe of 5 rows. And it's because the first element corresponds to the first row, the elements 2-4 correspond to the second row etc, so no easy correspondance.

Even worse, when you feed that into the if_else( %in% ), you're asking "is one of these elements in the classes_to_match?", and the answer is a single yes:

if_else(
  any(unlist(str_split(tab$drug_class, ";")) %in% classes_to_match),
  TRUE,
  FALSE
)
#> [1] TRUE

Because you gave a single value in the mutate(), it helpfully expanded it to fill the column (often useful, but not what you want here).

Here, you want to ask, for each row of the dataframe, is one of the drugs in classes to match. So we need to create our own vectorized function, for example with map_lgl():

map_lgl(tab$drug_class,
        \(one_row){
          any( str_split_1(one_row, ";") %in% classes_to_match )
        })
#> [1] FALSE  TRUE  TRUE FALSE  TRUE

And we can put that whole thing in the mutate:

tab |>
  mutate(matched = map_lgl(tab$drug_class,
                           \(one_row){
                             any( str_split_1(one_row, ";") %in% classes_to_match )
                           }))
#> # A tibble: 5 × 3
#>   card_short_name drug_class                                matched
#>   <chr>           <chr>                                     <lgl>  
#> 1 CblA-1          cephalosporin                             FALSE  
#> 2 SHV-52          carbapenem;cephalosporin;penam            TRUE   
#> 3 dfrF            diaminopyrimidine antibiotic              TRUE   
#> 4 CTX-M-130       cephalosporin                             FALSE  
#> 5 NDM-6           carbapenem;cephalosporin;cephamycin;penam TRUE

Note, you can do something equivalent by first splitting, then mapping on the list:

splitted_drugs <- str_split(tab$drug_class, ";")

map_lgl(splitted_drugs,
        \(one_set_of_drugs) any(one_set_of_drugs %in% classes_to_match))

Oxygen4985 · August 22, 2024, 8:14pm

Hi Alexis,

Thank you so much for that detailed explanation! The use of map_lgl() is working perfectly now.

Below is my final code:

df |> 
  mutate(matched = if_else(
    map_lgl(drug_class, ~ any(str_split_1(.x, ";") %in% classes_to_match)),
    TRUE,
    FALSE
  ))

AlexisW · August 22, 2024, 9:36pm

Oops I forgot to remove the tab$

Just a note, in your case I don't think you need the if_else(), you're doing "if TRUE then TRUE, and if FALSE then FALSE", does it serve any purpose?

system · August 29, 2024, 9:36pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.