I am relatively new to rstudio-server and I would really appreciate some help with a problem I am having. I have an rstudio-server setup on an aws ec2 instance (t2.large -> 8GiB memory) . I am working with a large online user-behaviour dataset (object.size() = 2.3 GB). I am able to read the file into rstudio-server, but then doing simple filtering tasks on the same object produces the following error: "The previous R session was abnormally terminated due to an unexpected crash. You may have lost workspace data as a result of this crash".
I assume this is because I have run out of useable memory, but my question is whether increasing the memory on my ec2 instance will fix this problem, or whether rstudio-server has an independent top limit memory capacity.
Rstudio doesn't has a memory limit so increasing the memory for your EC2 instance would help but maybe you want to look into more memory efficient packages (that edit data in place and do not create copies) or on-disk approaches like using a sql backend or big data specific database engines.
Wow, thank you @andresrcs for the fast reply. Currently I am cleaning the data up, therefore making changes to the object, but saving it as the same object. Does this still create a copy of the data? Also, fully agreed that a proper sql backend would be more efficient, that infrastructure is on the roadmap, but further down the line haha.
Depending on the packages you are using, yes, maybe. If memory allocation is your biggest concern right now, look into data.table or disk.frame packages.
You could simply install the sql server in the same EC2 instance and use an R wrapper (like dbplyr) to use it from R.
I've been looking up data.table. It seems to be very different to the now commonly accepted tidyverse method. Will there be a real improvement by using data.table method compared to tidyverse in terms of memory consumption?
I don't like data.table syntax either but for large datasets there is a noticeable improvement both in memory allocation efficiency and speed. You can get some of the benefits without sacrificing dplyr syntax by using dtplyr which provides a data.table backend for dplyr.
As a user of both methods I find it regrettable how data.table gets undermined (even unintentionally) by tidyverse advocates. The syntax is different but simple and concise on the whole (the exceptions being for more complex actions), but the memory efficiency is the main benefit.
dtplyr is an option to combine dplyr syntax with a data.table back-end even though I don't use it myself.