How do I effectively work with big datasets in RStudio Server?

JJMac · September 25, 2020, 1:46pm

I'm working with RStudio Server that's hosted by a big linux box. It seems to me like I should be able to work with a big amount of data. The box has 120GB+ memory and multiple CPU.

I'll share how I'm working with my datasets (~10GB) and where I'm having trouble.

I work with with data table and dplyr. I'm doing basic operations, like mutate, group by and summarize.

Yet, my code is often very slow and sometimes takes down the 120GB cluster.

What are other ways to work with big datasets in RStudio Server? Do I need to spin up a DB?

andresrcs · September 25, 2020, 3:46pm

You can get improvements (up to some extent) by using data.table instead of dplyr and you can even use it with dplyr syntax by using
dtplyr but if you are going to be using massive datasets on a regular basis you should consider big data specific solutions like Spark + Apache Arrow (There is an R package for that too).

JJMac · September 26, 2020, 1:31am

How complicated is it to set up Spark + Apache Arrow? We already have a spark cluster

andresrcs · September 26, 2020, 3:17am

I'm not an expert on this, take a look at this talk for pointers

system · October 17, 2020, 3:17am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.