I'm doing modelling (curve fitting, outlier detection) on a biggish data set that is sitting in Snowflake. Currently I am using DBI::dbGetQuery to obtain subsets of the data which I then save using arrow::feather and analyse using data.table and ggplot.
This is working for a small subset of the data (about 1%) but is already slow to process, even with data.table. Feather and ggplot are acceptably fast but summarising and doing calculations in data.table is slow. Now I need to work with a bigger subset of the data and I wonder whether R is the right tool for the job. Both memory and CPU seem to be constraints.
Anyone out there tried doing this and ended up moving to another analysis platform that can handle big tables (0.5B rows) (and calculation) much faster? Am I actually limited by my PC memory and CPU and need to work to cloud processing somehow (I have no experience with that)?
I've done datasets as large as 500K rows 500 columns, but had to subset on my reasonably well provisioned laptops (but not super tricked out). RAM was a limiting factor at a point even after relieving the ulimit constraint on how much RAM the OS will allocate (the default is usually 8GB in the Unices).
What I ended up doing was to shell script division into 25K row chunks for processing by fread() and rowwise feature extraction. An advantage was being able to detect and repair the occasional defective line at the shell level. Since if a script would run on the first complete batch, it would on the subsequent if there were no issues with some of the odder unicode points or missing lines. I think it ended up taking about 2-3 hours to loop over the file list, extract my features and append to a receiver object. But it would have been far too long for a set of files 1000 times larger.
I started off trying to use {bigmemory} but didn't want to set up Spark/Arrow for a one-off. Here is some general guidance for strategies. If I were doing this again, I'd do as much possible in shell streaming, including a lot I did before by R scripting, so as to minimize the number of variables I had to pull into an R session at any one time.
I've started playing with dbplyr to understand and subset and sample my big tables in Snowflake. This was reasonably successful and reduced the data to an extent I could easily do the modelling in R on my PC.
Different kinds of projects might not allow modelling on only a sample of the data. I suggested to our team that we explore Spark for this kind of thing. Although perhaps going to python would be better in that case since it opens up more modelling libraries and things like PyTorch and Snowpark (which I don't really understand yet).
I think we're still in early days as far as use by ordinary organizations of data not quite big data but still big enough to get arms fully round in a desktop environment. Depending on the use case, I see open source tools being better at mixing and matching for prototyping in an IDE, perhaps using large samples of the data. I find R top rate for old-school EDA and all the well-established statistical techniques. I've been away from Pyworld for a while but I hear what you do about the depth of some of the newer stuff out there. Eventually, when deployment time comes around, if there's going to be any scale to deal with the workchain will need re-implementation in Rust or Julia, both of which can integrate all the legacy C/C++ and Fortran libraries and are speedy. What used to be a project-managed process to coordinate a large coding team is rapidly hurtling to code review and testing of LLM translations. Exciting times in the playground.