Handling Big Data, what tools when


I'm super interested to know how people decide what tools to use to tackle a specific situation. The situation I'd like to explore here is the following:

I have a large database with tables that does not fit into memory. I want to explore, clean (filter bad rows), apply models, visualise, and report the data.

  • Do you use dplyr to return subsets of data to use your standard (fit in memory) tidyverse tools?
  • Do you use Spark? (Is it actually possible to transfer a large database table direclty to Spark with Sparklyr?)
  • Buy more memory.

What is your plan of attack and why.



The (in-development) chunked read functions in readr look pretty interesting for this type of thing. Pandas in Python has these, and you can read in parts of a big file, do what you need, save it, and read the next part etc. I don't know how far along the chunked_* functions are, but they'll be a welcome addition, that's for sure.

1 Like