I'm working with RStudio Server that's hosted by a big linux box. It seems to me like I should be able to work with a big amount of data. The box has 120GB+ memory and multiple CPU.
I'll share how I'm working with my datasets (~10GB) and where I'm having trouble.
I work with with data table and dplyr. I'm doing basic operations, like mutate, group by and summarize.
Yet, my code is often very slow and sometimes takes down the 120GB cluster.
What are other ways to work with big datasets in RStudio Server? Do I need to spin up a DB?
You can get improvements (up to some extent) by using data.table instead of dplyr and you can even use it with dplyr syntax by using dtplyr but if you are going to be using massive datasets on a regular basis you should consider big data specific solutions like Spark + Apache Arrow (There is an R package for that too).