Extract strings between multiple patterns and allow for mismatches

I have thousands of DNA sequences that look like this :).

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

I need to extract every sequence between the
CTACG and CAGTC. However, many cases in these sequences come with an error
(deletion, insertion, substitution). Is there any way to account for mismatches based on Levenshtein distance?

ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", 
         "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

qdapRegex::ex_between(ref, "CTACG", "CAGTC")
#> [[1]]
#> [1] "GTTATGTACGATTAAAGAAGATCGT"
#> 
#> [[2]]
#> [1] "CGTTGATATTTTGCATGCTTACTCC"
#> 
#> [[3]]
#> [1] NA

reprex()
#> Error in reprex(): could not find function "reprex"

Created on 2021-12-18 by the reprex package (v2.0.1)

Like this I would be able to extract the sequence also in the last case.

UPDATE: can I create a dictionary with a certain Levenshtein distance and then match it to each sequence?

I'm unclear on what string the extracted elements that do not match the start-end pattern is. This hack will get you a vector with all three, though.

library(stringr)
ref <- c(
  "CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC",
  "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

# define three groups
parts <- "(CTACG)(.*)(CAGTC)"

# extract middle group (first element of matrix is complete string)
res <- str_match(ref[1],parts)[3]

take_middle <- function(x) stringr::str_match(ref[x],parts)[3]

holder <- vector()
for(i in seq_along(ref)) {
  if (is.na(take_middle(i))) holder[i] = ref[i]
  else holder[i]=take_middle(i)
}
holder
#> [1] "GTTATGTACGATTAAAGAAGATCGT"           "CGTTGATATTTTGCATGCTTACTCC"          
#> [3] "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC"
2 Likes

If there was no indel, it would be pretty easy, you could get away with some simple substring() to extract at knows positions. But with indels, I don't see any simple way, you have to use some kind of alignment.

In Bioconductor, there is the package BioStrings with useful functions:

library(Biostrings)

ref <- DNAStringSet(c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC",
                      "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC",
                      "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC"))


trimLRPatterns("CCTACG", "CAGTC",
               subject = ref,
               max.Lmismatch = 1,
               max.Rmismatch = 1,
               with.Lindels = TRUE,
               with.Rindels = TRUE)
#> DNAStringSet object of length 3:
#>     width seq
#> [1]    25 GTTATGTACGATTAAAGAAGATCGT
#> [2]    25 CGTTGATATTTTGCATGCTTACTCC
#> [3]    24 GTTGATATTTTGCATGCTTACTCC

Created on 2021-12-19 by the reprex package (v2.0.1)

Here I had to change the left pattern to make it match (because trimLRpattern() expects the patterns to be on the left and right ends), this can be solved by first matching the left/right patterns (trimming anything outside), then trimming the patterns themselves. This code does it:

library(Biostrings)

ref <- DNAStringSet(c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC",
                      "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC",
                      "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC"))


trim_outside_patterns <- function(sequ, Lpattern, Rpattern){
  matched_view <- matchLRPatterns(Lpattern, Rpattern,
                                  subject = sequ,
                                  max.gaplength = 30,
                                  max.Lmismatch = 1,
                                  max.Rmismatch = 1,
                                  with.Lindels = TRUE,
                                  with.Rindels = TRUE)
  # select longest match
  
  matched_view[[which.max(width(matched_view))]]
}
trimmed <- endoapply(ref, trim_outside_patterns, "CTACG", "CAGTC")


trimLRPatterns("CTACG", "CAGTC",
               subject = trimmed,
               max.Lmismatch = 1,
               max.Rmismatch = 1,
               with.Lindels = TRUE,
               with.Rindels = TRUE)
#> DNAStringSet object of length 3:
#>     width seq
#> [1]    25 GTTATGTACGATTAAAGAAGATCGT
#> [2]    25 CGTTGATATTTTGCATGCTTACTCC
#> [3]    24 GTTGATATTTTGCATGCTTACTCC

Created on 2021-12-19 by the reprex package (v2.0.1)

There are more possibilities, if you want even more control you can check the algorithm argument of matchPattern() to do the left and right side separately, matchPdict-inexact() to use a dictionary of patterns, or even go with compareStrings() and what is described in ?lowlevel-matching.

1 Like

WOW! Thank you Alexis, thanks for the answer and the guidance...
I highly appreciate. Happy holidays

1 Like

Thank you so much for your time! I It looks so neat! highly highly appreciate!!!

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.