This code is meant to extract two information from one column in a dataframe, and use those as inputs to make substitutions in another string in a different column.
#data
data <- tibble(
sequence = "KBLFFKTYT",
mut = "Dummy (ignore); Test_2 (K1); Test-3 (K6)"
)
#function1
matcheR <- function(add){
str_match_all(string = add,
pattern = "(?<add>[a-zA-Z0-9_-]+\\s?[a-zA-Z0-9_-]+?) \\((?<site>[A-Z])(?<index>\\d+)\\)") |>
data.frame()
}
#function2
replaceR <- function(sequence, replace){
if (nrow(replace) == 0) {
return(sequence)
}
replacement <- as.character(replace$add)
position <- as.integer(replace$index)
chars_to_replace <- str_sub(sequence, position, position)
replaced_vector <- str_replace_all(sequence, set_names(replacement, chars_to_replace))
}
x <- data |>
mutate(sequence2 = map(mut, matcheR)) |>
mutate(sequence3 = map2_chr(.x= sequence, .y= sequence2,
~ replaceR(sequence = .x, replace = .y)))
output:
> x
# A tibble: 1 × 4
# sequence modifications sequence2 sequence3
# <chr> <chr> <list> <chr>
#1 KBLFFKTYT ASL-1 (ignore); Test-2 (K1); Test2 (K6) <df [2 × 4]> Test-2BLFFTest-2TYT
But the desired output of sequence3 column is Test-2BLFFTest-3TYT
.
Why the indices are not read correctly?
Thanks for the help.