This is a bit of a general question. I realize that there are many different variables here. I am looking for some general guidance ...
I am trying to do some analysis on a large number of csv files. In aggregate, these files take up about 70GB of disk space. The individual csv files range from ~100MB to ~7GB. The data in these files is very similar to gapminder data, just a lot larger (see screenshot below).
I need to ask my company's IT team for more computing resources - my laptop computer has only 8GB of RAM.
My questions:
How much computing resources should I ask for?
(I.e. do I need 70GB of RAM to read in 70GB of data? If not, how much do I need?)
What types of resources should I ask for (e.g. AWS cloud)?
Here are the types of analysis/process that I am looking to do:
Summarize time series entries into smaller units (e.g. monthly data into quarters)
Thank you @cderv , I had not come across that thread (I guess I wasn't searching using the right keywords). Am reading through it now. This has a lot of helpful ideas for exploration.
Try data.table with foreach. By the way, the data sample you showed seems to be pretty low frequency data, I am curious how these csv files add up to 70 GB.
Thank you @Peter_Griffin I will need to spend more time learning about data.table.
FYI, the screenshot that I had shared earlier was just an example with publicly available data (the gapminder training database). The real data looks something like this anonymized version below.
For each "item description" / "model number" in these files, there are a few years of either monthly or weekly data. Each csv contains info on 200-40000 items. The data adds up!
Unless the subsets are more tractable, you may want/need to use Spark for the regressions (e.g. via sparklyr), as the matrices involved may exceed available memory (even with more). Alternatively, OLS can be parallelized:
allowing bigger jobs to be split up into manageable batches. Or do regression in SQL:
which could even be run directly on the CSVs with Apache Drill:
You'll likely find that to an extent you can throw a lot of computing power at the problem to keep doing things the way you are now on small subsets, or throw a lot of brain power at the problem to work more efficiently. The tradeoffs are not always linear and vary by person; the best option depends on what you can get comfortable with easily enough to get it to work.
When you load a .csv file (f.e. via read.csv()) it gets parsed to an R object and stored in the system memory (RAM). Incidentally a csv file has roughly the same size as the parsed data (depending f.e. on the data types of the columns). So yes, to read a 70gb csv file you need roughly 70gb of RAM.
When you process a file, R often makes copies of objects, so to actually do something useful, you should have at least 2x the RAM of your dataset size (or more realistically 3x) + overhead for the operating system (~2GB). I've worked with a 4gb file on 8gb ram and it's a pain, if you are not super careful you can crash R pretty easily.
Solutions:
besides being the fastest way to process tabular data in R, data.table can help you avoid making unnecessary copies of datasets, so if you are super careful you might be able to process datasets that are greater than half the size of your available ram (though i do not recommend that). data.table is awesome, and in my opinion there is no reason not to use data.table for any task in R (slight exaggeration), but I don't think data.table alone is the solution to your problem.
aggregation, joins, simple calculations can all be done in databases. There the dataset sitze does not matter all that much. If you do not have access to a database, SQLite is pretty easy to set up from R to manage an on-disk file database.
If size is an issue, parallelisation is usually not the solution, as it usually increases your ram requirements even further by working on several copies on the same data
I don't have much to say to spark since I have never used it, if you have a spark cluster already available, its probably a good alternative to a database
Thank you @hoelk !! This is very helpful indeed. Definitely going to look into SQLite. I think a combination of SQLite and data.table should get me to a workable solution here.
create charts with ggplot. I anticipate creating a library of about 1,000
Everything there except the regression and the plots are operations that are well suited for a database. If it were me, I would use R to build an ETL process to push all the data into a database. I'd start with Postgres on my local machine because that's easy. If that were not performant then I would move the data to Redshift on Amazon AWS.
Then I would do all my joins and simple percentages using R and dplyr against the database. That means you write dplyr code in R but the logic gets executed in the database. Read more about that here.
Then I would go about figuring out how big the largest "chunk" of the data is that I want to work on. I'd do that by pulling only the records and fields from the DB that I need for one single regression into R and measuring how big the resulting data frame is by using pryr::object_size on each chunk. Then I'd replace that chunk and pull the next chunk from the DB. It may be that no single chunk is bigger than a few hundred MB. In which case you could comfortably process all your data on a moderate laptop.
One word of caution: You seem to have already decided on the use of list columns. That's premature given that you do not even have a workflow for your data conceived. Don't start with a method and then rework your workflow so that the method is possible. Start with a workflow then use the methods that work for your situation.