Substring matching with `str_split_1()` within `mutate()` throws an error

mutate() expects a function that takes a column (or several) and returns a column of the same length. So it is only well suited for vectorized functions (noting that you can easily build a vectorized function with map_*() if you have a function that takes one value and returns one value).

With str_split_1(), you run into a first problem: it's a function that takes a single string as input (i.e. a vector of length 1). So when you call str_split_1(drug_class, ";") you are trying to pass the entire column drug_class, it refuses.

Indeed, the correct vectorized function is str_split(). And as you correctly noticed, it takes a vector (e.g. a column), and returns a list, where each list element is a new vector with the results of the split. The best is to try it outside of mutate() to see what it does:

tab
#> # A tibble: 5 × 2
#>   card_short_name drug_class                               
#>   <chr>           <chr>                                    
#> 1 CblA-1          cephalosporin                            
#> 2 SHV-52          carbapenem;cephalosporin;penam           
#> 3 dfrF            diaminopyrimidine antibiotic             
#> 4 CTX-M-130       cephalosporin                            
#> 5 NDM-6           carbapenem;cephalosporin;cephamycin;penam

str_split(tab$drug_class, ";")
#> [[1]]
#> [1] "cephalosporin"
#> 
#> [[2]]
#> [1] "carbapenem"    "cephalosporin" "penam"        
#> 
#> [[3]]
#> [1] "diaminopyrimidine antibiotic"
#> 
#> [[4]]
#> [1] "cephalosporin"
#> 
#> [[5]]
#> [1] "carbapenem"    "cephalosporin" "cephamycin"    "penam"  

And when you unlist() this list, you get a single vector:

unlist(str_split(tab$drug_class, ";"))
#>  [1] "cephalosporin"                "carbapenem"                   "cephalosporin"               
#>  [4] "penam"                        "diaminopyrimidine antibiotic" "cephalosporin"               
#>  [7] "carbapenem"                   "cephalosporin"                "cephamycin"                  
#> [10] "penam"   

But now is a problem: by grouping all these together, you lost the position in the initial data frame! You now have a list of length 10, from a dataframe of 5 rows. And it's because the first element corresponds to the first row, the elements 2-4 correspond to the second row etc, so no easy correspondance.

Even worse, when you feed that into the if_else( %in% ), you're asking "is one of these elements in the classes_to_match?", and the answer is a single yes:

if_else(
  any(unlist(str_split(tab$drug_class, ";")) %in% classes_to_match),
  TRUE,
  FALSE
)
#> [1] TRUE

Because you gave a single value in the mutate(), it helpfully expanded it to fill the column (often useful, but not what you want here).

Here, you want to ask, for each row of the dataframe, is one of the drugs in classes to match. So we need to create our own vectorized function, for example with map_lgl():

map_lgl(tab$drug_class,
        \(one_row){
          any( str_split_1(one_row, ";") %in% classes_to_match )
        })
#> [1] FALSE  TRUE  TRUE FALSE  TRUE

And we can put that whole thing in the mutate:

tab |>
  mutate(matched = map_lgl(tab$drug_class,
                           \(one_row){
                             any( str_split_1(one_row, ";") %in% classes_to_match )
                           }))
#> # A tibble: 5 × 3
#>   card_short_name drug_class                                matched
#>   <chr>           <chr>                                     <lgl>  
#> 1 CblA-1          cephalosporin                             FALSE  
#> 2 SHV-52          carbapenem;cephalosporin;penam            TRUE   
#> 3 dfrF            diaminopyrimidine antibiotic              TRUE   
#> 4 CTX-M-130       cephalosporin                             FALSE  
#> 5 NDM-6           carbapenem;cephalosporin;cephamycin;penam TRUE   

Note, you can do something equivalent by first splitting, then mapping on the list:

splitted_drugs <- str_split(tab$drug_class, ";")

map_lgl(splitted_drugs,
        \(one_set_of_drugs) any(one_set_of_drugs %in% classes_to_match))