Finding distance among thousands of strings

loukesio · February 22, 2022, 9:14pm

My dataset looks like this but I have around 510^5 sequences/strings to compare pairwise and calculate their levenshtein distance.
I understand that these would lead to a matrix of 510^5 * 5*10^5 elements.

I have tried so far the following packages in our HPC but none of them can handle the size of the matrix.
levenR,
Biostrings::stringDist,
stringdist::stringdistmatrix,
tidystringdist

library(tidyverse)

x <- c("T", "A", "C", "G")
data <- expand.grid(rep(list(x), 5)) %>% 
  unite("sequences", 1:5, sep="")

head(data)
#>   sequences
#> 1     TTTTT
#> 2     ATTTT
#> 3     CTTTT
#> 4     GTTTT
#> 5     TATTT
#> 6     AATTT

^{Created on 2022-02-22 by the reprex package (v2.0.1)}

Is there any trick that I can follow to achieve my goal for counting lv distance?
Can I parallelise the process and if yes how? Would it make sense?

I appreciate your time. Any guidance and help are highly appreciated

nirgrahamuk · February 24, 2022, 12:00am

If i understood you correctly you are hoping to compute as many as (510^5)^2 comparisons.
This is an octillion and seems quite inconceivable

system · March 17, 2022, 12:01am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.