Hello everyone!
For "training" purposes I am trying to make a tool that predicts Hbonds between two DNA strands. So strand1 (AGAGCCTGT) is always the same and the input is strand2 and the function then gives you the number of possible Hbonds between them. So, when GC comes together we have 3Hbonds, for AT 2 Hbonds, GT gives 1. This is what I have:
Bonds <- matrix(
data = c(0,0,0,0,0,0,2,0,2,
0,3,0,3,0,0,0,3,0
0,0,0,0,3,3,0,0,0
2,0,2,0,0,0,0,0,0),
ncol = 9, byrow = TRUE, dimnames = list(LETTERS[c(1,3,7,20)],paste0('pos',1:9))
)
BondPred <- function(vec, Bonds){
mat <- strsplit(vec, split = "")[[1]] |> as.matrix() |> `rownames<-`(paste0('pos',1:9))
res <- vector(length = 1)
for (i in seq.default(1,ncol(Bonds))){
res <- res + Bonds[[which(mat[[i]] == rownames(Bonds)),i]]
}
return(res)
}
This gives me for example:
BondPred(c("TCTCGGACA"), Bonds)
>23
BondPred(c("TATAGGACA"), Bonds)
>17
So, I think that works. But my problem is now that some basepairs are depending on others. So for example: the sequence "TATAGGACA" gets a score of 17 (pos1 hast a AT, pos2 nothing, pos3 an TA, ...) but in this context, if an AT basepair is not stabilised at neighbouring positions, it would not form. Here the basepairs at pos1 and 2 are not stabilised and thus would actually not form. So we would have a score of 13. Its not important if this is really true in nature. I am just trying new things to get more skills in writing functions.
Does someone have an idea how I could do that? My idea was maybe to divide the sequence into tripletts (so "TATAGGACA" would be divided into TA, TAT, ATA, TAG, AGG, GGA, GAC, ACA, CA). If I would assign scores for each pos based on the triplett I could include the info about neighbouring nucleotides. However, I have no idea how to do that? Can someone help me? Or maybe also someone has a better idea how to do it?
Thanks a lot for every help