How to parallel 4 nested loops in R on Windows

Jeremy98-alt · April 25, 2020, 7:25pm

I used this code to compare all genes to each other. The code function, but the main problem is that the execution is slow. Is there anyway to make this code faster? There are four nested loop but i don't understand how to make this code in parallel.

Id_GeneNameTwoGenes <- Id_GeneName3[1:10,]

cl <- makeCluster(detectCores()-1)
registerDoParallel(cl)

system.time({
 n   <- nrow(Id_GeneNameTwoGenes)
 clusterExport(cl,"n")
 nms <- rownames(Id_GeneNameTwoGenes)
 clusterExport(cl,"nms")
 V1  <- rep(nms[1:(n-1)],seq(from=n-1, to = 1, by = -1))
 clusterExport(cl,"V1")
 V2  <- unlist(parLapply(cl,1:(n-1), function(i)(nms[(i+1):n])))
 clusterExport(cl,"V2")

 weight <- unlist(lapply(1:(n-1), function(i) (
   sapply((i+1):n, function(j) {
       rowx <- Id_GeneNameTwoGenes[i,colnames(Id_GeneNameTwoGenes[i,which(Id_GeneNameTwoGenes[i,] == 1 & colnames(Id_GeneNameTwoGenes) != "ENTREZID")])]
       rowy <- Id_GeneNameTwoGenes[j,colnames(Id_GeneNameTwoGenes[j,which(Id_GeneNameTwoGenes[j,] == 1 & colnames(Id_GeneNameTwoGenes) != "ENTREZID")])]

       weight2 <- unlist(lapply(1:ncol(rowx), function(k) (
          sapply(1:ncol(rowy), function(w) {
          sh <- distance.matrix[colnames(rowx[k]),colnames(rowy[w])]
          if ( sh != "Inf")
            sh
          else 
            NA

       })
   )))

      mean(weight2, na.rm = TRUE)
   })
)))

similarity.matrix <- data.frame(source=Id_GeneNameTwoGenes[V1,1],dest=Id_GeneNameTwoGenes[V2,1],weight=weight)


q <- quantile(unlist(similarity.matrix), probs = c(.25, .5, .75))
filter.matrix.25 <- similarity.matrix[which(similarity.matrix$weight >= q[1]),]
filter.matrix.50 <- similarity.matrix[which(similarity.matrix$weight >= q[2]),]
filter.matrix.75 <- similarity.matrix[which(similarity.matrix$weight >= q[3]),]
})

registerDoSEQ()

Jeremy98-alt · April 26, 2020, 9:59am

I tried with the libraries like: snow, parallel, doParallel etc, but nothing.
I ran this code but in one day the programm was still running.
V1 & V2 contain 208 milion number of rows that represent genes. the loop weight compare two genes and weight2 compare for two genes (2 of 20.408) their presence ontology and catch the distance between ontologies, if there is else continue with other genes.

DavoWW · April 26, 2020, 12:28pm

Hi @Jeremy98-alt,
Welcome to the RStudio Community Forum.

I don't have any experience with this particular type of problem but here's some things I would check.

When your function is running can you see that multiple cores are being utilised (e.g. TaskManager if running under Windows)?
Are you using 64-bit R with a decent amount of RAM (minimum 16GB)?
Can you run the function with a subset of the data, and profile the code to see where the speed bottlenecks are? See ?Rprof.
Maybe post your question on a bioinformatics-related group where others may have done similar things. If comparing genes in this way is a common problem, maybe there are packages on Bioconductor that are already suitable?
This blog (among others) may give you some ideas for speed improvements:
https://datascienceplus.com/strategies-to-speedup-r-code/

HTH

nirgrahamuk · April 26, 2020, 2:00pm

I haven't analysed your code to verify , taking you at your word that

you should understand that any algorithm with such poor efficiency will die a death and parallelising wont save it.
Recommend you have a quick read around Big-O notation...
But here is a simple demo of what 4 nested loops can do to your machine.



test_speed_On4 <- function(len){
reslist<-0
for (i in 1:len)
  for(j in 1:len)
    for(k in 1:len)
      for (l in 1:len)
      {
        reslist<-c(reslist,i+j+k+l)
      }
reslist
}

library(microbenchmark)
 
 microbenchmark(n10 = n10 <-test_speed_On4(10),
               n11 = n11 <- test_speed_On4(11),
               n12 = n12<- test_speed_On4(12),
               n15 = n15<- test_speed_On4(15),
               times=3L,
               unit = 's') # seconds
 # Unit: seconds
 # expr       min         lq      mean    median        uq       max neval
 # n10 0.0865216 0.08779405 0.1294490 0.0890665 0.1509127 0.2127589     3 
 # n11 0.1904384 0.19120460 0.2304073 0.1919708 0.2503918 0.3088127     3 
 # n12 0.3723204 0.37404500 0.3757565 0.3757696 0.3774746 0.3791796     3 
 # n24 2.3749047 2.41319985 2.4416075 2.4514950 2.4749589 2.4984227     3 
 
 library(ggplot2)

 ggplot(data= data.frame(x=c(0,3,6,12,15),y=c(0,.13,.23,.38, 2.4))) + xlim(0,20)+ stat_smooth(aes(x,y),formula=(y~exp(x)),method='lm',fullrange = TRUE)

it might take 400 seconds to loop around 20.

Jeremy98-alt · April 26, 2020, 2:54pm

Hi @DavoWW
thank you so much to help me in this project!

Me too, I haven't much experience with R Studio, I'm using it for a project. I answer to your questions:

Basically, I don't see if the program executes with 12 cores, but I see that to create the V2 Variable stay 0.01 seconds!
Yes, I have 16gb of RAM, but I don't extend the memory.limit in R, I think that isn't necessary, I saw that I used only 60% of memory during execution.
I can try, but this solution doesn't permit me to solve this problem, the main problem is the weight and weight2 variables.
Where Can I post this discussion? Can you give me the link?
I saw 2 days ago, but I already use apply function, but is difficult to increase the speed of this code... I should see on bioconductor later, but I already see on internet and i can't find any solution

Jeremy98-alt · April 26, 2020, 2:59pm

Hello @nirgrahamuk,
I know about the complexity but i wanted to compare 20.000 genes and establish the grade of similarity, so first I want to compare all the genes to each other, for each gene i choose their ontologies for estabilish the degree of similarity.

But, its much slow despite I'm using apply functions.

Jeremy98-alt · April 26, 2020, 6:00pm

I systemed the code, I deleted the parallelization because to create n, nms, V1 & V2 pass 1 minute and 30 seconds, so good, but i don't know how to change the weight and weight2. V1 & V2 contain all comparison that I want to do:

Id_GeneNameTwoGenes <- Id_GeneName3[1:10,]

system.time({
  n   <- nrow(Id_GeneNameTwoGenes)
 nms <- rownames(Id_GeneNameTwoGenes)
 V1  <- rep(nms[1:(n-1)],seq(from=n-1, to = 1, by = -1))
 V2  <- unlist(lapply(1:(n-1), function(i)(nms[(i+1):n])))

weight <- unlist(lapply(1:(n-1), function(i) (
     sapply((i+1):n, function(j) {
     rowx <- Id_GeneNameTwoGenes[i,colnames(Id_GeneNameTwoGenes[i,which(Id_GeneNameTwoGenes[i,] == 1 & colnames(Id_GeneNameTwoGenes) != "ENTREZID")])]
     rowy <- Id_GeneNameTwoGenes[j,colnames(Id_GeneNameTwoGenes[j,which(Id_GeneNameTwoGenes[j,] == 1 & colnames(Id_GeneNameTwoGenes) != "ENTREZID")])]
  
     weight2 <- unlist(lapply(1:ncol(rowx), function(k) (
        sapply(1:ncol(rowy), function(w) {
            sh <- distance.matrix[colnames(rowx[k]),colnames(rowy[w])]
            if ( sh != "Inf" )
              sh
           else 
              NA
  
    })
  )))
  
  mean(weight2, na.rm = TRUE)
})
)))

similarity.matrix <- data.frame(source=Id_GeneNameTwoGenes[V1,1],dest=Id_GeneNameTwoGenes[V2,1],weight=weight)

  q <- quantile(unlist(similarity.matrix), probs = c(.25, .5, .75))
  filter.matrix.25 <- similarity.matrix[which(similarity.matrix$weight >= q[1]),]
  filter.matrix.50 <- similarity.matrix[which(similarity.matrix$weight >= q[2]),]
  filter.matrix.75 <- similarity.matrix[which(similarity.matrix$weight >= q[3]),]
})

nirgrahamuk · April 26, 2020, 6:10pm

If your data is in Id_GeneName3 you can use dput(Id_GeneName3) to transform it into a shareable form. If it's too long you can use dput(head(Id_GeneName3,n=50)) it share only the first 50 entries

Jeremy98-alt · April 27, 2020, 11:32am

Hi @nirgrahamuk,
Id_GeneName3 contains all genes that I use for comparing all genes to each other, why i should use dput? I want to compare all 20.408 genes to each other... but the computation is huge! I want to search a function or something else to reduce this computation!!!

nirgrahamuk · April 27, 2020, 11:37am

if your requirement is fundamentally complex, its impossible to overcome. You're the one with the supposed expertise in your domain. If you say that everything has to be compared with everything else, then this implies naively a On2 complexity. This by definition will not 'scale' well.
each time your double the set size , you quadruple the required computations.
You mentioned you set up something with 4 nested loops, exponentially worse...

the idea of dput is to share examples of your data with people that can examine your issue with you.

Jeremy98-alt · April 27, 2020, 1:01pm

Hi @nirgrahamuk,

in fact, i would to find a solution to make parallel the execution of these loops, where if it's possible obviously.

I would to know, how to set the loops in parallel if it's possible. I'm not an expert in R ahah

nirgrahamuk · April 27, 2020, 1:10pm

The most a parellising improvemnt could give you would be a linear speed up.
rather than running on 1 core, running on 12 cores could make code execute 12 times faster ( under perfect conditions that are typically not possible to acheive)

We should really profile your code on small example data, to estimate the current full runtime.
if the runtime is expected to be 1 day on 1 core, then you could hope that you might approach some fraction of completing in a couple of hours on 12 (even though that would be somewhat idealistic).
If the runtime is expected to be 1 year on 1 core, then running on 12 might mean you could have some justification to expect on the order of a month or two if you sucessfully parallelise.

I wouldnt waste energy parallelising without trying to make the most efficient algorithm, and this would include heuristics for avoiding comparisons wherever possible. and i wouldnt begin parallelisation before I could reasonable estimate that my code would complete in reasonable finite time, if i was succesfull at doing it.

nirgrahamuk · April 27, 2020, 1:14pm

here is an example of someone profiling code to achieve faster runtime, they had similar challenge of combinatorial explosion, but given the speed improvements we made to inner loop of the calculation the problem became tractable.

How to efficiently calculate pairwise overlaps of many sets? General

I have a list of many sets, and I want to calculate the overlap of the elements for each pairwise comparison. In my actual application, the sets contain names of genes grouped by some shared characteristic (e.g. from an annotation database such as gene ontology or reactome). But to distill the problem, the input data is a named list of character vectors. I have written a brute force approach to calculate each pairwise comparison. It works fine when the number of sets is small (<1000 sets). The problem is that these biological ontologies can have tens of thousands of sets, and the combinatorial explosion kills my brute force approach. For example, 10,000 sets requires {{10,000}\choose{2}}=49…

Jeremy98-alt · April 27, 2020, 1:31pm

@nirgrahamuk, I read the discussion, interesting.

here there's a little visualization of my data. Considere that rowx and rowy in dei weight function select only presence ontologies and next i compare all genes from my dataframe per ontologies. So, the selection of ontologies in rowx and rowy are minium not contain 22784 ontologies.

Thanks for these answers!

Jeremy98-alt · April 27, 2020, 1:48pm

At the same time in V1 and V2, I have the right combination, so it's impossible have overlaps

nirgrahamuk · April 27, 2020, 2:02pm

without having hands on any example of data its all a bit too abstract.
images only show a very limited view, and I cannot apply code to them.

Jeremy98-alt · April 27, 2020, 2:37pm

i can't show all , R can't display it @nirgrahamuk

.....

3471019 = 2292575L, 3471020 = 2292576L, 3471021 = 2292577L,
3471022 = 2292578L, 3471023 = 2292579L, 3471024 = 2292580L,
3471025 = 2292581L, 3471026 = 2292582L, 3471027 = 2292583L,
3471028 = 2292584L, 3471029 = 2292585L, 3471030 = 2292586L,
3471031 = 2292587L, 3471032 = 2292588L, 3471033 = 2292589L,
3471034 = 2292590L, 3471035 = 2292591L, 3471036 = 2292592L,
3471037 = 2292593L, 3471038 = 2292594L, 3471039 = 2292595L,
3471040 = 2292596L, 3471041 = 2292597L, 3471042 = 2292598L,
3471043 = 2292599L, 3471044 = 2292600L, 3471045 = 2292601L,
3471046 = 2292602L, 3471047 = 2292603L, 3471048 = 2292604L,
3471049 = 2292605L, 3471050 = 2292606L, 3471051 = 2292607L,
3471052 = 2292608L, 3471053 = 2292609L, 3471054 = 2292610L,
3471055 = 2292611L, 3471056 = 2292612L, 3471057 = 2292613L,
3471058 = 2292614L, 3471059 = 2292615L, 3471060 = 2292616L,
3471061 = 2292617L, 3471062 = 2292618L, 3471063 = 2292619L,
3471064 = 2292620L, 3471065 = 2292621L, 3471066 = 2292622L,
3471067 = 2292623L, 3471068 = 2292624L, 3471069 = 2292625L,
3471070 = 2292626L, 3471071 = 2292627L, 3471072 = 2292628L,
3471073 = 2292629L, 3471074 = 2292630L, 3471075 = 2292631L,
3471076 = 2292632L, 3471077 = 2292633L, 3471078 = 2292634L,
3471079 = 2292635L, 3471080 = 2292636L, 3471081 = 2292637L,
3471082 = 2292638L, 3471083 = 2292639L, 3471084 = 2292640L,
3471085 = 2292641L, 3471086 = 2292642L, 3471087 = 2292643L,
3471088 = 2292644L, 3471089 = 2292645L, 3471090 = 2292646L,
3471091 = 2292647L, 3471092 = 2292648L, 3471093 = 2292649L,
3471094 = 2292650L, 3471095 = 2292651L, 3471096 = 2292652L,
3471097 = 2292653L, 3471098 = 2292654L, 3471099 = 2292655L,
3471100 = 2292656L, 3471101 = 2292657L, 3471102 = 2292658L,
3471103 = 2292659L, 3471104 = 2292660L, 3471105 = 2292661L,
3471106 = 2292662L, 3471107 = 2292663L), class = "omit"), row.names = c(1L,
77L, 347L, 384L, 425L, 619L, 817L, 924L, 1233L, 1620L, 2133L,
2660L, 2981L, 3152L, 3297L, 3419L, 3636L, 5194L, 5436L, 5741L,
5856L, 6120L, 6388L, 6763L, 7162L, 7452L, 7710L, 7887L, 7985L,
8313L, 8690L, 8834L, 9039L, 9334L, 9723L, 10119L, 10372L, 10578L,
10749L, 11087L, 11145L, 11228L, 11423L, 11708L, 11731L, 11950L,
12187L, 12698L, 12949L, 13247L), class = "data.frame")

nirgrahamuk · April 27, 2020, 3:09pm

I'm afraid your brackets dont match up here, so that code isnt runnable.
try using head() with dput() to select a portion.

dput(head(Id_GeneName3,n=50))

Jeremy98-alt · April 27, 2020, 5:39pm

I done that @nirgrahamuk, I had that huge code and however with one row...

I systemed the code and I increase the speed, but i should preallocate rowx and rowy:

system.time({
  n   <- nrow(Id_GeneNameTwoGenes)
  nms <- rownames(Id_GeneNameTwoGenes)
  V1  <- rep(nms[1:(n-1)],seq(from=n-1, to = 1, by = -1))
  V2  <- unlist(lapply(1:(n-1), function(i)(nms[(i+1):n])))
  
  similarity.matrix <- data.frame(source=V1,dest=V2)
  
  weight <- apply(similarity.matrix, 1, function(row) {
      #preallocate list of vect
      rowx <- colnames(Id_GeneNameTwoGenes[row["source"],which(Id_GeneNameTwoGenes[row["source"],] == 1 & colnames(Id_GeneNameTwoGenes) != "ENTREZID")])
      rowy <- colnames(Id_GeneNameTwoGenes[row["dest"],which(Id_GeneNameTwoGenes[row["dest"],] == 1 & colnames(Id_GeneNameTwoGenes) != "ENTREZID")])
      
      weight2 <- sapply(tryA, function(k) (
        sapply(tryB, function(w) {
          sh <- distance.matrix[k,w]
          if ( sh != "Inf" )
            sh
          else 
            NA
          
        })
      ))
      
      mean(weight2, na.rm = TRUE)
    })
  
  similarity.matrix$weight <- weight
  similarity.matrix$source <- Id_GeneNameTwoGenes[V1,1]
  similarity.matrix$dest <- Id_GeneNameTwoGenes[V2,1]

  q <- quantile(unlist(similarity.matrix), probs = c(.25, .5, .75))
  filter.matrix.25 <- similarity.matrix[which(similarity.matrix$weight >= q[1]),]
  filter.matrix.50 <- similarity.matrix[which(similarity.matrix$weight >= q[2]),]
  filter.matrix.75 <- similarity.matrix[which(similarity.matrix$weight >= q[3]),]
})

nirgrahamuk · April 27, 2020, 7:16pm

Ok sure you had one row with however many thousands of columns. I'm sure you could figure a way to reduce it if it was a priority.

Bets of luck.