Detecting complete and partial patterns in vector of numbers


I am looking for some ways to find patterns in set of common numbers. All the numbers can only contain values from 1 to 5. I unfortunately don't know all the ways in which patterns will present themselves but I want to at least pick up partial patterns and quantify it.

Below I have added some sequences. Any idea how to approach this?

#should find the repetition of 1,2,3

#should find the repetition of 3,2,1

#should find the 3,2 and 5,3 repititon

#should find the larger pattern 1,4,5,4,1 and or 5,4,1 repeating

#should not find any patterns 


get_ngram <- function(numvec,len){
  result <- NA

  inside_count <- length(numvec)-len+1

  if(inside_count>1) {

    first_pass <- list()
    iloop <- seq_len(inside_count)
    for(i in iloop){

      first_pass[[i]] <- numvec[i:(i+len -1 )]
    second_pass <- unique(first_pass[duplicated(first_pass)])


get_ngrams <- function(numvec){
  l <- 2:(length(numvec)-1)
      ~get_ngram(numvec,.)) %>% set_names(paste0("length_",l))



#should find the repetition of 3,2,1

#should find the 3,2 and 5,3 repititon

#should find the larger pattern 1,4,5,4,1 and or 5,4,1 repeating
#actually only 541 exists, there is no 14541 pattern to find

#should not find any patterns 
1 Like

Ahhh you're amazing! It is like you have answers to everything :rofl:

I understand your point regarding the c(1,4,5,4,1) example. Is there some way in R to get R to work out which number should likely come next in a set so if we say had c(1,4,5,4,x) that it would substitute in 1? I want to be able to find less obvious patterns like that too. Ngrams in general make a lot of sense for this. I think in part I am going to use your solution and flip the set around to read it from right to left as well (given it doesn't feature as parts of a word here)

I'm afraid I don't really follow what you are asking.
It seems like the idea is to go beyond matching on repeated patterns to some definition of an almost detectable pattern ? You could brute force solutions for that, but I would only think its worth trying if the vectors you analyse are not much more longer than this, because it would scale awfully poorly.

Yes, that would be it - basically a detectable pattern. In some cases I will have lengths up to 20 long in a respective vector. I was hoping there was some sort of mathmatical solver or way to run some clever set of diff to derive that set. I suppose to fit a lm or such wouldn't work as you can't know the shape of that line beforehand or readily find a way to solve it either?

Currently, I am thinking of stringing each full set as a "hash" of sorts and compare that directly to others. With the number of combinations etc I should also see a fair spread.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.