I'm going to be conducting an analysis with a pretty large dataset: ~134million observations of 8 variables and am expecting some memory difficulties with data wrangling (I'll have 64GB RAM available).
Ideally I'd like to use a list column for a lot of the data which would drop the number of rows down to about 1.1 million.
My question is whether this gives any memory advantage - or do the list columns take up more memory?
So, will just one column be a list column, or will every (or almost every) column be a list column? The first case would be more likely to save memory. I would try it out with a subset of your data, and check the results using object.size or pryr::object_size.
As a toy example, using a list with one numeric column A and one numeric list column B with a 100 entries per A row, unnesting increases the data size by about 50%:
If you are pushed to the limit on memory you could try using data.table. Its speed is mentioned most, but I think its memory efficiency is its strongest point.
I have managed over 630 million records (3 columns, I think) with 32 GB RAM using it without any crashes.
You can be interested by this current discussion here
data.table was already mentionned by @martin.R - I can confirm that its memory efficiency is its strongest point. data.table has a special syntax and mechanism to work on data by reference, therefore limiting the copy in memory. Now, it is pretty different from dplyr in syntax.
For memory efficiency in the tidyverse, I think you can try using database for your data. dplyr works very well with database connection. see rstudio website about database Using a SQLlite data.table and dplyr verbs can help you deal with big dataset.
Looking back at the description of the data I'll be working with it'll be 7 columns total. I'd like to make this into 1 column and 1 list-column (at a push 2 and 1) so potentially a big save in memory given your example - but yes, I'll need to test it.
I need a (fairly) simple summary from this data and am keen to use purrr functions on the list-column to obtain these so hopefully this will work (and keep everything in-memory and in-tidyverse).
If not, as @martin.R and @cderv point out - there are alternatives that will work.
Options so far are good. One thing that hasn't been brought up yet is testing out your code on just a sample of the total data. Usually that works pretty well for me, though with less structured list columns something unexpected could potentially happen after you expand to the full dataset. Much easier to make sure your code does what it is supposed to on small data first then add MOAR RAM.
Hi @davidhen, another option I like may be using sparklyr locally in your machine. You can create multiple tables and then relate them via dplyr joins, that way you won't need list columns. There a bit more detail in this reply: Limitations of R