My dataset looks like this but I have around 510^5 sequences/strings to compare pairwise and calculate their levenshtein distance.
I understand that these would lead to a matrix of 510^5 * 5*10^5 elements.
I have tried so far the following packages in our HPC but none of them can handle the size of the matrix.
levenR,
Biostrings::stringDist,
stringdist::stringdistmatrix,
tidystringdist
library(tidyverse)
x <- c("T", "A", "C", "G")
data <- expand.grid(rep(list(x), 5)) %>%
unite("sequences", 1:5, sep="")
head(data)
#> sequences
#> 1 TTTTT
#> 2 ATTTT
#> 3 CTTTT
#> 4 GTTTT
#> 5 TATTT
#> 6 AATTT
Created on 2022-02-22 by the reprex package (v2.0.1)
Is there any trick that I can follow to achieve my goal for counting lv distance?
Can I parallelise the process and if yes how? Would it make sense?
I appreciate your time. Any guidance and help are highly appreciated