Assigning values to letters by considering neighbours

Lina1 · October 27, 2022, 11:55am

Hello everyone!

I want to bind a tool that gives me an overall score for a certain character. Meaning that my input is a character and for every position a certain letter gets assigned with a certain score. In the end the function adds all values and gives you an overall score for the character input. That is what I did:

MaN <- matrix(
  data = c(0,0,0,0,0,0,2,0,2,
           0,3,0,3,0,0,0,3,0,
           0,0,0,0,3,3,0,0,0,
           2,0,2,0,0,0,0,0,0),
  ncol = 9, byrow = TRUE, dimnames = list(LETTERS[c(1,2,3,4)],paste0('pos',1:9))
)

Letter_func <- function(vec, MaN){
  mat <- strsplit(vec, split = "")[[1]] |> as.matrix() |> `rownames<-`(paste0('pos',1:9))
  res <- vector(length = 1)
  for (i in seq.default(1,ncol(MaN))){
    res <- res + MaN[[which(mat[[i]] == rownames(MaN)),i]]
  }
  return(res)
}

That would give me for example:

Letter_func(c("DBDBCCABA"), MaN)
>23

Letter_func(c("DADACCABA"), MaN)
>17

That works so far. However, I would like to take this to the next level and assign a score to each position that depends on the neighbouring letters. I hope one can understand what I am saying. Just an example. If the first 3 position are ABA I would get a score of 3 for this 3 position (because of the second position. However, I want to tell the function that it gives me a different score when the neighbouring letters got a 0. So in this ABA case I would want the function to give me a score of 1 cause next to pos 2 there are only letters with scores of 0. However, if the character starts with ABD, i would want a score of 5 cause now the B has a neighbour that has a score of 2.
My idea was to divide the character into triplets that include the neighbours . So that character DADACCABA would be divided into DA, DAD, ADA, DAC, ACC, CCA, CAB,ABA, BA. Then I could for every position assign a score for each possible combination of these 4 letters. However, I dont know how to do that and I am very lost.
Can someone help me? or maybe also someone has a better idea how to take neighbours into account?

I would be super thankful for every help and really want to learn!

FactOREO · October 27, 2022, 12:56pm

Hey,

I really tried to understand you logic, but it is not fully clear to me. Take your two examples with the first 3 letters:

ABA without neighbours gives 3, because "a" always is zero at pos 1 and 3
ABA with neighbours should give you 1 - I guess it is because there are two zeros next to your value of 3
BUT
why should ABD then give a value of 5? This is the output you would get with your current function anyways, despite the fact that according to the position-value matrix the sum would read 0 + 3 + 2, so there is a zero as neighbour involved.

Should only following neighbours be taken into account and if they occur, the sum be reduced by 2? Or what other kind of logic is behind your example?

Maybe you can clarify it a bit, just to get a feeling of the logic you want to achieve.

Kind regards

Lina1 · October 27, 2022, 1:59pm

Thanks for taking the time trying to understand the idea.

Lets stick with the examples:
You got absolutely right what I was saying about the ABA example. Usually B at position 2 is a score of 3 but if both neighbours or 0, it only gets only a 1. This is why the result should be a 1.
However, if one of the neighbours is not 0 it would still get a 3. This is why ABD should still get a 3, just like the function does now.
Reducing by 2 (from 3 to 1) would be only the case for letters assigned with a 3 when BOTH neighbours are 0.
But for letters that get only a score of 2: they would also be reduced by 2 (resulting in 0), but in their case they always need 2 "strong" neighbours. So, a letter that usually gets a 2 gets a 0 if ONE of the neighbours have a 0.

It is difficult for me because the rule for letters with a 2 is different than for letters with a 3. This is why I thought to divide it into triplets but failed greatly.

I know it sounds super weird, which is actually one more reason why I want to do it. I think I could learn a lot with it.

FactOREO · October 28, 2022, 3:52pm

Hello,

I think I've come up with a somewhat useful (and to some extend generalized) solution. It sill relies on meaningful entry values and I did not include validity checks on those, but regarding your input from the request this will work:

MaN <- matrix(
  data = c(0,0,0,0,0,0,2,0,2,
           0,3,0,3,0,0,0,3,0,
           0,0,0,0,3,3,0,0,0,
           2,0,2,0,0,0,0,0,0),
  ncol = 9, byrow = TRUE, dimnames = list(LETTERS[c(1,2,3,4)],paste0('pos',1:9))
)

library('collapse')
#> collapse 1.8.8, see ?`collapse-package` or ?`collapse-documentation`
#> 
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#> 
#>     D
library('data.table')

letters_to_score <- function(string, posMat = MaN, incl.neighbours = FALSE){
  # make sure colnames are unique and meaningful if not provided
  if(is.null(rownames(posMat))) rownames(posMat) <- LETTERS[seq.default(1,nrow(posMat))]
  if(is.null(colnames(posMat))) colnames(posMat) <- paste0('pos', seq.default(1,ncol(posMat)))
  
  # convert to data.table object
  posMat <- posMat |> qDT(row.names.col = 'letter') |> melt.data.table(id.vars = 'letter', variable.name = 'position', value.name = 'score')
  
  # split given string into chunks + make sure to match with uppercase letters (just to be on the safe side)
  string_dt <- stringr::str_split(string, pattern = "") |>
    unlist() |>
    qDT() |>
    # add a column referring to the positions
    fmutate(position = paste0('pos',seq.default(1,nchar(string)))) |>
    setColnames(c('letter','position')) |>
    ftransform(letter = toupper(letter))
  
  ### Section if we want to include neighbours
  if (incl.neighbours){
    # join the alphabet with the string
    posMat[string_dt, on=c('letter','position'),nomatch=NULL] |>
      # add a column with shifted values for position matching
      fmutate(
        # the default is fill with NA, but this will cause issues later on
        # easy fix: assign a non-zero value as fill, since only 0s will cause action later on
        score_backward = shift(score, n = 1, type = 'lag', fill = 1L),
        score_forward  = shift(score, n = 1, type = 'lead', fill = 1L),
        # calculate the "true" score
        score2 = fcase(
          # score is 2, but there is a neighbour with score 0
          score == 2L & (score_backward == 0L | score_forward == 0L), 0L,
          # score is two and there is no 0 score neighbour
          score == 2L & (score_backward != 0L & score_forward != 0L), 2L,
          # score is 3 and there are two 0 score neighbours
          score == 3L & (score_backward == 0L & score_forward == 0L), 1L,
          # score is 3 and there are less than two 0 score neighbours
          score == 3L & (score_backward == 0L | score_forward == 0L), 3L,
          # everything else is 0
          default = 0L)
      ) |>
      fsummarise(
        result = fsum(score2)
      )
  } else {
    # join the alphabet with the string
    posMat[string_dt, on=c('letter','position'),nomatch=NULL] |>
      fsummarise(
        result = fsum(score)
      )
  }
}

### check the function
letters_to_score(string = "ABA", posMat = MaN, incl.neighbours = TRUE)
#>    result
#> 1:      1
letters_to_score(string = "ABA", posMat = MaN, incl.neighbours = FALSE)
#>    result
#> 1:      3

^{Created on 2022-10-28 with reprex v2.0.2}

Feel free to ask questions if something is not clear

Kind regards

system · December 9, 2022, 3:52pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.