Hello, I am a data analyst using mainly R and SAS.
While I generally prefer to use R and tidyverse
tools for my data science and programming tasks, I miss SAS datasets whenever R data frames consume all the memory.
I could use variety of R packages to handle large data (bigmemory
, ff
, dplyr
interface to databases, etc.), but unified binary data format on disk as in SAS has several advantages: ease of data management, no need to learn additional syntax, being able to create many intermediate datasets without caring about RAM (which makes debugging easy).
So, I want to ask R community:
-
What are your ideas, techniques, workflows, and best practices to handle out-of-memory data in R ? (I assume that data is several gigabytes or bigger, but not as big as it needs large-scale infrastructure for distributed computing.)
-
Is there any new technology or project to this problem worth watching ?