How much RAM do I need to process ~70GB of csv files?

Hello,

This is a bit of a general question. I realize that there are many different variables here. I am looking for some general guidance ...

I am trying to do some analysis on a large number of csv files. In aggregate, these files take up about 70GB of disk space. The individual csv files range from ~100MB to ~7GB. The data in these files is very similar to gapminder data, just a lot larger (see screenshot below).

I need to ask my company's IT team for more computing resources - my laptop computer has only 8GB of RAM.

My questions:

  1. How much computing resources should I ask for?
    (I.e. do I need 70GB of RAM to read in 70GB of data? If not, how much do I need?)
  2. What types of resources should I ask for (e.g. AWS cloud)?

Here are the types of analysis/process that I am looking to do:

  • Summarize time series entries into smaller units (e.g. monthly data into quarters)
  • calculate percentages, ratios, year-over-year growth rates
  • various types of joins on the data coming from multiple csv files
  • after joining, nest by variables to create one or more list-columns
  • apply linear regression modeling on the list-columns (inspired by 25 Many models | R for Data Science)
  • create charts with ggplot. I anticipate creating a library of about 1,000

I hope that questions is not too vague. Any thoughts/comments would be really helpful!

Thanks!

gapminder

1 Like

Just to let you know that you could find this topic of interest, if you not already saw it

2 Likes

Thank you @cderv , I had not come across that thread (I guess I wasn't searching using the right keywords). Am reading through it now. This has a lot of helpful ideas for exploration.

1 Like

Try data.table with foreach. By the way, the data sample you showed seems to be pretty low frequency data, I am curious how these csv files add up to 70 GB.

Thank you @Peter_Griffin I will need to spend more time learning about data.table.

FYI, the screenshot that I had shared earlier was just an example with publicly available data (the gapminder training database). The real data looks something like this anonymized version below.

For each "item description" / "model number" in these files, there are a few years of either monthly or weekly data. Each csv contains info on 200-40000 items. The data adds up!

Unless the subsets are more tractable, you may want/need to use Spark for the regressions (e.g. via sparklyr), as the matrices involved may exceed available memory (even with more). Alternatively, OLS can be parallelized:

https://dzone.com/articles/linear-regression-with-map-reduce

allowing bigger jobs to be split up into manageable batches. Or do regression in SQL:

which could even be run directly on the CSVs with Apache Drill:

You'll likely find that to an extent you can throw a lot of computing power at the problem to keep doing things the way you are now on small subsets, or throw a lot of brain power at the problem to work more efficiently. The tradeoffs are not always linear and vary by person; the best option depends on what you can get comfortable with easily enough to get it to work.

4 Likes

Thank you @alistaire. This is very helpful indeed. Looks like I have to explore a few different options and figure out what works best!

To answer your questions directly:

Memory Requirements:

When you load a .csv file (f.e. via read.csv()) it gets parsed to an R object and stored in the system memory (RAM). Incidentally a csv file has roughly the same size as the parsed data (depending f.e. on the data types of the columns). So yes, to read a 70gb csv file you need roughly 70gb of RAM.

When you process a file, R often makes copies of objects, so to actually do something useful, you should have at least 2x the RAM of your dataset size (or more realistically 3x) + overhead for the operating system (~2GB). I've worked with a 4gb file on 8gb ram and it's a pain, if you are not super careful you can crash R pretty easily.

Solutions:

  • besides being the fastest way to process tabular data in R, data.table can help you avoid making unnecessary copies of datasets, so if you are super careful you might be able to process datasets that are greater than half the size of your available ram (though i do not recommend that). data.table is awesome, and in my opinion there is no reason not to use data.table for any task in R (slight exaggeration), but I don't think data.table alone is the solution to your problem.
  • aggregation, joins, simple calculations can all be done in databases. There the dataset sitze does not matter all that much. If you do not have access to a database, SQLite is pretty easy to set up from R to manage an on-disk file database.
  • If size is an issue, parallelisation is usually not the solution, as it usually increases your ram requirements even further by working on several copies on the same data

I don't have much to say to spark since I have never used it, if you have a spark cluster already available, its probably a good alternative to a database

5 Likes

Thank you @hoelk !! This is very helpful indeed. Definitely going to look into SQLite. I think a combination of SQLite and data.table should get me to a workable solution here.

1 Like

Let me draw out a point which others have implied:

You only need enough RAM to fit the largest single chunk of data you want to operate on in R.

You list a number of tasks:

  • Summarize time series entries into smaller units (e.g. monthly data into quarters)
  • calculate percentages, ratios, year-over-year growth rates
  • various types of joins on the data coming from multiple csv files
  • after joining, nest by variables to create one or more list-columns
  • apply linear regression modeling on the list-columns (inspired by 25 Many models | R for Data Science)
  • create charts with ggplot. I anticipate creating a library of about 1,000

Everything there except the regression and the plots are operations that are well suited for a database. If it were me, I would use R to build an ETL process to push all the data into a database. I'd start with Postgres on my local machine because that's easy. If that were not performant then I would move the data to Redshift on Amazon AWS.

Then I would do all my joins and simple percentages using R and dplyr against the database. That means you write dplyr code in R but the logic gets executed in the database. Read more about that here.

Then I would go about figuring out how big the largest "chunk" of the data is that I want to work on. I'd do that by pulling only the records and fields from the DB that I need for one single regression into R and measuring how big the resulting data frame is by using pryr::object_size on each chunk. Then I'd replace that chunk and pull the next chunk from the DB. It may be that no single chunk is bigger than a few hundred MB. In which case you could comfortably process all your data on a moderate laptop.

One word of caution: You seem to have already decided on the use of list columns. That's premature given that you do not even have a workflow for your data conceived. Don't start with a method and then rework your workflow so that the method is possible. Start with a workflow then use the methods that work for your situation.

6 Likes

Thank you @jdlong. You have given me some very good ideas for next steps. Will attempt Postgres as well.

As for plotting, there is also dbplot package (https://db.rstudio.com/dbplot/), so it can also be done in a DB (to some extent, of course).

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.