Another cool thing about sparklyr
is that you can use it without a cluster or external server. The "local" mode (https://spark.rstudio.com/articles/deployment-overview.html#deployment) will create a Spark context in your laptop (Windows, Mac or Linux). I've done experiments on my laptop where I "map" the large files using Spark, so when I perform some dplyr
commands, they are actually being performed in Disk and not in Memory. I then just import the data I want into the Spark cache. Because of how Spark works, another nice thing is that I can actually map multiple files that have the same layout as if there were one table, so I can actually query across files w/o bringing anything into memory. Here are a couple of links that may be of help: https://spark.rstudio.com/articles/guides-caching.html and https://github.com/rstudio/webinars/blob/master/42-Introduction%20to%20sparklyr/sparklyr-webinar1.Rmd
7 Likes