predict Hydrogen bonds tool

Hello everyone!

For "training" purposes I am trying to make a tool that predicts Hbonds between two DNA strands. So strand1 (AGAGCCTGT) is always the same and the input is strand2 and the function then gives you the number of possible Hbonds between them. So, when GC comes together we have 3Hbonds, for AT 2 Hbonds, GT gives 1. This is what I have:

Bonds <- matrix(
  data = c(0,0,0,0,0,0,2,0,2,
           0,3,0,3,0,0,0,3,0
           0,0,0,0,3,3,0,0,0
           2,0,2,0,0,0,0,0,0),
  ncol = 9, byrow = TRUE, dimnames = list(LETTERS[c(1,3,7,20)],paste0('pos',1:9))
)

BondPred <- function(vec, Bonds){
  mat <- strsplit(vec, split = "")[[1]] |> as.matrix() |> `rownames<-`(paste0('pos',1:9))
  res <- vector(length = 1)
  for (i in seq.default(1,ncol(Bonds))){
    res <- res + Bonds[[which(mat[[i]] == rownames(Bonds)),i]]
  }
  return(res)
}

This gives me for example:

BondPred(c("TCTCGGACA"), Bonds)
>23

BondPred(c("TATAGGACA"), Bonds)
>17

So, I think that works. But my problem is now that some basepairs are depending on others. So for example: the sequence "TATAGGACA" gets a score of 17 (pos1 hast a AT, pos2 nothing, pos3 an TA, ...) but in this context, if an AT basepair is not stabilised at neighbouring positions, it would not form. Here the basepairs at pos1 and 2 are not stabilised and thus would actually not form. So we would have a score of 13. Its not important if this is really true in nature. I am just trying new things to get more skills in writing functions.

Does someone have an idea how I could do that? My idea was maybe to divide the sequence into tripletts (so "TATAGGACA" would be divided into TA, TAT, ATA, TAG, AGG, GGA, GAC, ACA, CA). If I would assign scores for each pos based on the triplett I could include the info about neighbouring nucleotides. However, I have no idea how to do that? Can someone help me? Or maybe also someone has a better idea how to do it?

Thanks a lot for every help :slight_smile:

You may find others with knowledge in the area of genetics fully equipped to assist you, however if you hope for assistance from the wider community with respect to a pure R challenge; then it would be fruitful to explain the nature of the calculation in terms that laymen in genetics can understand. I perceive that the way you have presented your challenge largely excludes R experts that lack knowledge in the area of hydrogen bonds from assisting you.

1 Like

You are actually right. The biology behind it is actually not important. I will open a new topic and descripe only the idea of the function without the biology. Thank you

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.