Hi everyone,

I am quite new to R programming and have been using it for the package Bio3d. I have a very large file which I have not been able to analyse on my own laptop, but I have access to a HPC with lots of computational power. I am trying to run a script which would let me make use of large amounts of RAM over several cores.

My script currently reads as:

library(ggplot2)

library(grid)

library(plyr)

library(dplyr)

library(gridExtra)

library(extrafont)

library(bio3d)

setwd("/home/ucbecla/Scratch")

#get trajectory and pdb

trj <- read.dcd("E_Test.dcd")

pdb <- read.pdb("E_Backbone.pdb")

#co ordinates

ca.inds <- atom.select(pdb, elety = "CA")

xyz <- fit.xyz(fixed = pdb$xyz, mobile = trj, fixed.inds = ca.inds$xyz, mobile.inds = ca.inds$xyz)

rm(trj)

#pca_1

pc <- pca.xyz(xyz[, ca.inds$xyz], mass = pdb)

#pca cluster by groups

hc <- hclust(dist(pc$z[, 1:2]))

grps <- cutree(hc, k = 6)

dend = as.dendrogram(hc)

rm(hc)

write.table(grps, "6_clusters.txt", sep="\t")

#Get frame number closest to centre of clusters

get_mid <- function(z, clust){

mid_clust <- colMeans(z[grps == clust,1:2])

rel <- z[grps == clust,1:2] - mid_clust

frame <- which(sqrt(rel[,1]**2+rel[,1]**2) == min(sqrt(rel[,1]**2+rel[,1]**2)))

frame <- which(sqrt(rel[,1]**2+rel[,1]**2) %in% min(sqrt(rel[,1]**2+rel[,1]**2)))[1]

mid_rep <- z[grps == clust,1:2][frame,]

rep_frame <- which(z[,1:2] == mid_rep)[1]

rep_frame <- which(z[,1:2] %in% mid_rep)[1]

return(rep_frame)

}

mid_c1 <- get_mid(pc$z,1)

print(mid_c1)

mid_c2 <- get_mid(pc$z,2)

print(mid_c2)

mid_c3 <- get_mid(pc$z,3)

print(mid_c3)

mid_c4 <- get_mid(pc$z,4)

print(mid_c4)

mid_c5 <- get_mid(pc$z,5)

print(mid_c5)

mid_c6 <- get_mid(pc$z,6)

print(mid_c6)

rm(grps)

png("Dendrogram.png", width = 567, height = 473, res = 600)

fviz_dend(cut(dend, h = 250)$upper, k = 6, k_colors = c("green", "blue", "magenta", "red", "black", "purple"), type = "rectangle", ylab = "", show_labels = FALSE)

dev.off()

However I think it is currently using only a single core of the 36 available; and since I have 720,018 frames that takes up the entire RAM memory of the core (roughly 41.5 Gb RAM). It usually reads up to the fit.xyz command and I get the error "cannot allocate memory".

Would there be a way to make it so that it runs over several cores, and therefore does not run out of memory ?

Many thanks,

Christophe