str_match_all, str_sub, str_replace_all

Hassanhijazi · November 26, 2023, 12:45am

This code is meant to extract two information from one column in a dataframe, and use those as inputs to make substitutions in another string in a different column.


#data
data <- tibble(
  sequence = "KBLFFKTYT",
  mut = "Dummy (ignore); Test_2 (K1); Test-3 (K6)"
  
)

#function1
matcheR <- function(add){
str_match_all(string = add,
                           
                            pattern = "(?<add>[a-zA-Z0-9_-]+\\s?[a-zA-Z0-9_-]+?) \\((?<site>[A-Z])(?<index>\\d+)\\)") |> 
    data.frame() 
  }


#function2
replaceR <- function(sequence, replace){
  
  if (nrow(replace) == 0) {
    return(sequence)  
  }
  replacement <- as.character(replace$add)
  position <- as.integer(replace$index)
  chars_to_replace <- str_sub(sequence, position, position)
  replaced_vector <-  str_replace_all(sequence, set_names(replacement, chars_to_replace))
  
  
}

x <- data |> 
  mutate(sequence2 = map(mut, matcheR)) |> 
  mutate(sequence3 = map2_chr(.x= sequence, .y= sequence2,
                              ~ replaceR(sequence = .x, replace = .y)))

output:

> x
# A tibble: 1 × 4
#  sequence  modifications                           sequence2    sequence3          
#  <chr>     <chr>                                   <list>       <chr>              
#1 KBLFFKTYT ASL-1 (ignore); Test-2 (K1); Test2 (K6) <df [2 × 4]> Test-2BLFFTest-2TYT

But the desired output of sequence3 column is Test-2BLFFTest-3TYT.
Why the indices are not read correctly?

Thanks for the help.

technocrat · November 26, 2023, 9:40am

I'm sorry that I can't help you debug the tidyverse style code because my short-term memory is fading and I have to cut my code into bite sized pieces. I also find that doing that makes it easier to think about the cases that the code might have to handle.

# inputs
d <- c("KBLFFKTYT", "Dummy (ignore); Test_2 (K1); Test-3 (K6)")
# target output
y1 <- data.frame(V1 = c("Test_2 (K1)", "Test-3 (K6)"), add = c("Test_2", 
"Test-3"), site = c("K", "K"), index = c("1", "6"))
y2 <- "Test-2BLFFTest-3TYT"

.simpleCap <- function(x) {
  s <- strsplit(x, " ")[[1]]
  paste(toupper(substring(s, 1, 1)), substring(s, 2),
        sep = "", collapse = " ")
}
de_seq    <- function(x,y) unlist(strsplit(x[1],tolower(y)))[2:3]
get_asi   <- function(x) {
  parts = sapply(make_y1(x),function(x) strsplit(x," ")) |> unlist()
  attributes(parts) = NULL
  add    = parts[c(1,3)]
  site   = sapply(parts[c(2,4)],function(x) gsub("[^A-Za-z]","",x))
  attributes(site)  = NULL
  index  = sapply(parts[c(2,4)],function(x) gsub("[^0-9]","",x))
  attributes(index) = NULL
  return(matrix(c(add,site,index), ncol = 3))
  }

get_tests <- function(x) {
  tests = trimws(unlist(strsplit(x[2],";")))[2:3]
  return(tests)
}
make_y1     <- function(x){
  x        = get_tests(x)
  v1       = make_hyphen(x)
  left     = unlist(strsplit(v1[1]," "))
  left[1]  = .simpleCap(left[1])
  left[2]  = toupper(left[2])
  left     = paste(left[1],left[2])
  right    = unlist(strsplit(v1[2]," "))
  right[1] = .simpleCap(right[1])
  right[2] = toupper(right[2])
  right    = paste(right[1],right[2])
  
  return(c(left,right))
}
make_hyphen <- function(x) sub("_","-",x)
prelim      <- function(x) {
  intake = sapply(x,tolower)
  attributes(intake) = NULL
  intake[2] = trimws(intake[2])
  return(intake)
}
split_tests <- function(x) {
  x                  = get_tests(x)
  left               = unlist(strsplit(x[1]," "))
  right              = unlist(strsplit(x[2]," "))
  left_test          = .simpleCap(left[1])
  right_test         = .simpleCap(right[1])
  result             = sapply(c(left_test,right_test),make_hyphen)
  attributes(result) = NULL
  return(result)
}

intake      <- prelim(d)
pieces      <- toupper(de_seq(intake,"k"))
result      <- split_tests(intake)
Sequence2 <- rbind(y1[-c(1:2),],
              cbind(make_y1(intake),get_asi(intake)))
colnames(Sequence2) = c("V1","add","site","index")
Sequence3   <- paste0(result[1],pieces[1],result[2],pieces[2])

# for further processing, as required
Sequence2
#>            V1    add site index
#> 1 Test-2 (K1) Test-2    K     1
#> 2 Test-3 (K6) Test-3    K     6
Sequence3
#> [1] "Test-2BLFFTest-3TYT"

^{Created on 2023-11-26 with reprex v2.0.2}

Hassanhijazi · November 26, 2023, 3:15pm

Great. I think your approach to split then concatenate is the way to go. It is a good moment to learn base R from your code as well. Thanks for sharing.

I was insisting on using only str_sub but obviously this does not work in my case. a for loop would do the job.

Here is the updated function and it is working for me:

replaceR <- function(sequence, replace) {
  
    if (nrow(replace) == 0) {
      return(sequence)
  }
  
  replaced_vector <- unlist(str_split(sequence, ""))
  
  
  for (i in seq_len(nrow(replace))) {
    position <- as.integer(replace$index[i])
    
    # Replace the character at the specified index
    replaced_vector[position] <- replace$add[i]
  }
  
  
  return(str_c(replaced_vector, collapse = ""))
  
}

Thank you for the inspiration.

AlexisW · November 27, 2023, 4:31pm

I think your current version with the split on each character is better, but if you wanted to replace within the string (might be more efficient for very long strings), something like that could work:

replaceR <- function(sequence, replace){
  
  if (nrow(replace) == 0) {
    return(sequence)  
  }
  replacement <- as.character(replace$add)
  position <- as.integer(replace$index)
  for(i in nrow(replace):1){
    upstream <- substr(sequence,
                       start = 1L,
                       stop = position[[i]])
    downstream <- substr(sequence,
                         start = position[[i]] + 2L,
                         stop = nchar(sequence))
    
    # update sequence
    sequence <- paste0(upstream, replacement[[i]], downstream)
  }
  sequence
}

note that I had to loop from high to low index, as the position number within the sequence changes once you start editing it.

I think ultimately str_sub() and friends are doomed, as you need to replace at specific positions, these only work on patterns (in your previous code, you're replacing any K in the input sequence).

Hassanhijazi · November 29, 2023, 1:25pm

Amazing. Thank you for enlightening me on how to use substr properly. It also works faster than str_sub. And the "loop from high to low index" is a very smart idea.

I just replaced 2L with 1L because what is more convenient to me is to concatenate the replacement position with the 'add' and not to completely replace it. This works well for my purpose.

Thanks you very much!

system · December 6, 2023, 1:26pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.