How can I increase RStudio memory to process large number of samples

Eman · April 23, 2024, 8:15pm

Hi,
I process over a thousand genomic samples in R and I need to increase R memory to complete the workflow successfully. RStudio is installed on Ubuntu, and my machine has 6 cores. Any suggestions on how to increase RStudio memory?
Also, does paid RStudio have extra memory over the free one?

Thanks,
Eman

JonesYaniv · April 24, 2024, 1:59am

Hi Eman,
I'm not sure I fully understand your question,

Having 6 cores is nice, but cores give you processing power and not memory.
It is not clear from the question if you lack physical memory to store the data file before calculations starts or if during calcs the RAM runs out.
RStudio such as any compiler will you whatever memory it can find and will use paging RAM to hard drive when needed.
What coding methods have you taken to increase CPU use over Memory use?

Possible solutions not related to RStudio:
1st thing to always try is free up memory by closing unnecessary apps (RAM) or clean up the hard drive.

If these won't help possible more drastic measures are:

Install another or just upgrade your SSD drive.
Install more RAM, or better RAM (faster with more memory) - this might require new motherboard.

Or alternatively:
Use external services like SPARK or COLLABORATORY to upload the data and run the code on server - these services have free to use limited edition which could be a solution (Collaboratory has monthly quote).
I have never tried but POSIT does have paid solutions such as POSIT CONNECT and POSIT CLOUD

Eman · April 25, 2024, 3:51pm

Hi,

Thank you very much for your thorough response. I'm currently utilizing the DADA2 pipeline to process 16S PacBio sequences. The denoise step poses a significant challenge when employing the pooling option, as the code halts and the process is aborted. With the pooling option, the software attempts to load a large number of sequences, approximately 750GB in size, all at once. It appears that the memory capacity of R is unable to handle this load. Hence, I've resorted to using the pseudo option, although the results differ significantly.
The memory of my device is:
total used free shared buff/cache available
Mem: 62Gi 13Gi 7.9Gi 431Mi 41Gi 48Gi
Swap: 2.0Gi 936Mi 1.1Gi

I haven't employed any coding methods to prioritize CPU usage over memory usage. A few months back, I attempted to execute this R script on a more robust server, only to encounter failure at the same step (during the pooling code). This leads me to believe that the issue is indeed related to the memory capacity of the R software itself, irrespective of the specifications of the server where R is installed. While I'm not entirely certain, this is my understanding based on previous attempts.
So, I'm unsure which option would effectively resolve this issue: upgrading my computer, investing in a private cloud service, or utilizing an external service as you suggested. However, ultimately, I'll still be running the R script in RStudio with its limitations.

Thank you!