Hardware requirements for big data study

spalan · December 18, 2020, 8:57pm

As part of a research project, I will have to analyze a dataset of 8 columns and 650 million rows. I can only work on the data for about two and a half months, so I need to make sure I'll be able to run my analyses when I get the data (in about a month's time). The analyses will involve calculating effect size estimates, z/t-statistics and corresponding p-values for different hypotheses. Sorry I cannot be more specific - part of the project is not to know the exact hypotheses in advance.

My question now is what kind of hardware would be good to work on this data. I have worked with large datasets before, but none this large. I have a modern workstation with i7 processor, sufficient disk space and currently 16GB of RAM, running Windows 10 and the latest version of RStudio. I assume RAM will be the bottleneck. Does anyone have a suggestion regarding how much I should have to work with this kind of data?

Best,
Stefan.

nirgrahamuk · December 18, 2020, 10:29pm

Assuming 8 columns each of 64bit numeric for 650million rows, amounts to just under 42GB

spalan · December 18, 2020, 11:11pm

Sorry, I can actually concretize the column classes. There is:

1 date/time
4 integers
3 booleans

I just created some sample data (using only 650000 rows) and object_size() tells me a data.table of these dimensions (scaled up by factor 1000) would require 23GB. But that would mean that if I used any operation that duplicated the data in memory, this would not work, right?

nirgrahamuk · December 18, 2020, 11:25pm

Right. I would look at if I could summarise the data.
Also a lot of boot strap sampling with any given sample only being a small fraction of the total.

Either that, or pay $$$ for cloud compute

spalan · December 18, 2020, 11:54pm

Right. I will probably not use all columns at all times, so that will further cut down on memory requirements. I'll check cloud computing offerings and consider my options.

Thanks for the advice!

andresrcs · December 19, 2020, 12:53am

You could also consider on-disk approaches instead of loading all your data into memory like databases (you can manipulate the data using dplyr-like syntax using dbplyr), packages like disk.frame, or big data-specific tools like Spark + Apache Arrow (Interfaces with R using sparklyr).

spalan · December 21, 2020, 7:48pm

Thanks @andresrcs. I have already arranged to get time on an RStudio Pro Server with 128GB RAM, but running a simple OLS regression on sample data of the same size as the data I will be working on still didn't work due to insufficient memory. I'll have to look into how to run the analyses I need with minimal memory overhead.

system · January 11, 2021, 7:48pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.