If there was no indel, it would be pretty easy, you could get away with some simple substring()
to extract at knows positions. But with indels, I don't see any simple way, you have to use some kind of alignment.
In Bioconductor, there is the package BioStrings with useful functions:
library(Biostrings)
ref <- DNAStringSet(c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC",
"CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC",
"CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC"))
trimLRPatterns("CCTACG", "CAGTC",
subject = ref,
max.Lmismatch = 1,
max.Rmismatch = 1,
with.Lindels = TRUE,
with.Rindels = TRUE)
#> DNAStringSet object of length 3:
#> width seq
#> [1] 25 GTTATGTACGATTAAAGAAGATCGT
#> [2] 25 CGTTGATATTTTGCATGCTTACTCC
#> [3] 24 GTTGATATTTTGCATGCTTACTCC
Created on 2021-12-19 by the reprex package (v2.0.1)
Here I had to change the left pattern to make it match (because trimLRpattern()
expects the patterns to be on the left and right ends), this can be solved by first matching the left/right patterns (trimming anything outside), then trimming the patterns themselves. This code does it:
library(Biostrings)
ref <- DNAStringSet(c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC",
"CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC",
"CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC"))
trim_outside_patterns <- function(sequ, Lpattern, Rpattern){
matched_view <- matchLRPatterns(Lpattern, Rpattern,
subject = sequ,
max.gaplength = 30,
max.Lmismatch = 1,
max.Rmismatch = 1,
with.Lindels = TRUE,
with.Rindels = TRUE)
# select longest match
matched_view[[which.max(width(matched_view))]]
}
trimmed <- endoapply(ref, trim_outside_patterns, "CTACG", "CAGTC")
trimLRPatterns("CTACG", "CAGTC",
subject = trimmed,
max.Lmismatch = 1,
max.Rmismatch = 1,
with.Lindels = TRUE,
with.Rindels = TRUE)
#> DNAStringSet object of length 3:
#> width seq
#> [1] 25 GTTATGTACGATTAAAGAAGATCGT
#> [2] 25 CGTTGATATTTTGCATGCTTACTCC
#> [3] 24 GTTGATATTTTGCATGCTTACTCC
Created on 2021-12-19 by the reprex package (v2.0.1)
There are more possibilities, if you want even more control you can check the algorithm
argument of matchPattern()
to do the left and right side separately, matchPdict-inexact()
to use a dictionary of patterns, or even go with compareStrings()
and what is described in ?lowlevel-matching
.