Handling huge data frames

Hello,
I am considering different approaches when processing data frames with huge number of rows. By huge I mane something like 150 million rows with ~20 columns. A few questions come to mind, I'll start with the two major ones:

  1. Using factors:
    Suppose that some of my columns has significantly less than 150 million distinct values. By significantly less I'd say at least a factor ten less, but sometimes at a factor of thousands.
    It would make sense to me that making these columns factors should significantly speed up operations that apply to the full column, as the operation can be applied only the levels of the factor without updating 150 M rows.
    To my surprise, this is not always the case: Trying various transformations, like parsing a string of timestamp, or even an operation on a POSIXct date, is not always faster with factors than with the column being a regular character (or POSIXct) type.
    What am I missing? is it not always worthwhile using factors when the number of distinct values is significantly smaller than the number of values stored in a column?
    (I understand that looking up the value in particular rows might be less efficient because of the level of indirection, but I am surprised that operating on a full column is not more efficient with factors)

  2. Splitting the data frame to a few smaller data frames:
    Suppose that I can divide the data frame based on some categorization; e.g. year and month of a sample, or any other grouping criteria. By doing that it is trivial to split the data frame into many smaller ones, based on the group criteria, and, say, store them all as a list of smaller data frames.
    I assume that this would probably help when I'd like to run an operation only one a few of the groups.
    My question is w.r.t full data frame operation: would it be more efficient to run an l*ply operation to apply a function to all rows, group by group, rather than keep the data frame as a huge single piece and apply the operation directly to it?

That's it for now.
Thanks!

Have you considered parking the data in a database? If you only work with part of a column at a time, rather than needing to futz with all 150M rows at once, this could be faster.

2 Likes

Thanks for your response, @prubin. I am not sure I understand what you mean by a database, and whether it is a particular database supported in R. However, in general for my use case, there is no one unique criteria I'd like to divide the data across. Think, for example, that I have the quotes of 1000 equities at minute resolution over a year. In some cases, I'll want to see the behavior of a few (say 10) equities over the whole year -- that's 1% of the data. In some other cases I may want to summarize the behavior of all equities in a particular month -- that's ~8% of the whole data.
So, while I can easily choose one criteria to divide the data by (e.g. quotes month), I will usually need to go month by month, subset or summarize the data that I want in each. If my data frame wouldn't be that big, I could have easily done it with ddply, having the flexibility to choose a different criteria at a time. If I need to decide upfront on a criteria by which I split the data frame, then I may need to use ldply with a subset to summarize applied to each node in order to achieve the same result.

Thank you

I think I might have found the reason why factors do not always help to achieving better performance. It seems like factor levels are always strings (character type). So, if fore example I have a column of type Date, with thousands of rows for each, factorizing it would mean to transform the dates to a character, then transform it back to a Date type whenever I need to apply a date related operation over the full column -- and then transform it back to a string to store as the levels of the factor. That can explain why sometimes factoring of non string data may result in a worse performance than if we're working with unfactored data.

My understanding is that factors are stored internally as integers (index of the factor level). The strings are just external representations.

Regarding using a database, what I meant was to install a database server (such as MySQL, MariaDB or Postgresql) on your machine, and store the data in that (either as one big table or as separate tables, depending on what makes the most sense). You would then query the database to extract bits and pieces for analysis. R has libraries for interacting with various types of databases. So, for example, to look at quotes for 10 securities in a particular month, you might run 10 SQL queries (one per month) of the form "SELECT Quote FROM Table WHERE Date >= d1 AND Date <= d2 and Security = s". I'm assuming you have a table named Table with a columns named Quote, Date and Security. I'm using "s" for the security of interset and [d1, d2] for the time interval of interest. It's also possible to grab a table of quotes and security names for 10 selected securities in one query.

One reason to consider an external database is the the folks who write database software optimize it for handling large databases.

Factors: You're right, it is the levels that are of strings type, and the data frame only holds the index into the levels. My point though was that if your column holds something other than strings: E.g dates, numbers, timestamps --- than you may be better off keeping the actual data in the data frame rather than factor levels -- as the values lose their type once stored as levels. So, even though you operate on the levels, the operation may become much more expensive as it will require re-creating the original object (say POSIXcT) from the string, applying the operation to that object, and then casting it back to a string. There is probably some threshold -- when the casting overhead is worth operating only on the levels -- but my point was that there is some overhead to use factors for vectors of type that are not strings.
As doe databases: thanks. I'm not very familiar with those, and in particular about the ability to query them based on different criterias (or a conjunction of criterias). Basically the equivalent of ddply on a large table that let me slice it via a combination of different column values. Based on your example, it seems that databases are capable of doing that (and more) -- however, I'd assume that it comes with a price that you do not have all the data in cache/memory, and need to access the disk/server when searching through the table. My table is huge, but still fits in memory. Once I putt out of it the data I need, I can save the table to disk and work with the chunk I care about. Every time I need to chunk based on different criteria, I'll need to read the whole table from disk to the memory though.

Some code examples might give others a bit better idea of what you are dealing with or what kind of processing you are doing. Note that even if you are just turning (non-date/time) strings to factors, you might not gain much as R keeps unique strings in a global string pool and character vectors are basically vectors of pointers. For example rep("foo", 100) results in a vector of 100 identical pointers and a single "foo" instance in memory, this is not too different from how unordered factors are handled, but with less overhead.


Performance-wise, you'd probably gain more if you'd focus on your use & choice of string and date/time handling routines, if you currently happen to use methods from base R (or Tidyverse packages), it might be worth looking around for some more performant alternatives. For example, stringi, stringfish for strings. For date/time consider evaluating clock, anytime, fasttime.

You might also want to prefer more specialized methods (e.g. if you care only about the first match, stop early; and don't use regex methods where fixed strings would do; if stringi::stri_replace_first_coll() does what you need, prefer that over stringi::stri_replace_all_regex(); when parsing date/time values, use multi-format parsers only when really needed).

For a collection of performance-oriented packages you could check fastverse


Split-apply-combine could make a huge difference if you can and plan to parallelize the process. "Could", as it really depends on what you are dealing with and what's the actual task. For example if your parallel tasks are really short but workers need to send back a relatively large amount of data instead of small aggregates, total time might actually increase due to introduced overhead. Your system can also play a role here, as R process forking is not supported in Windows, depending on your choice of parallel framework, this can affect worker startup and data exchange.

You also have a few plyr references here and there, the package status is retired and you'd likely get all the functionality with dplyr + purrr (if you lean towards Tidyverse). With purrr it would also be easy to switch to current parallel processing options, be it through the Futureverse packages or with purrr+ mirai (purrr::in_parallel()). Or just use mirai directly, without purrr.


Your post is also tagged with dplyr, which can be somewhat vague in this context. dplyr (the package) is not always considered the best choice for performance-oriented tasks and/or bigger datasets, this is more in the domain of data.table and perhaps collapse. On the other hand, dplyr(-like) syntax, because it is verbose, easy to read and write for humans, can be used with quite a few different packages and backends. For example dtplyr or tidytable to use dplyr syntax with data.table. Or dbplyr to use your dplyr pipelines and many common methods with SQL databases, both remote and local, without actually writing any SQL. Or arrow, to manage and process larger(and larger-than-memory) datasets. collapse syntax can also be quite close.

The issue with all these abstractions is that you can get quite far with just dplyr syntax without bothering about internals or backends those packages actually use, but at some point you will likely encounter a problem that requires you to know at least some data.table or SQL. So there's this paradox, people with data.table / SQL skills can work around most limitations and can come up with dplyr-pipelines that are running on high-performance backends and which are also easy to read by those only familiar with dplyr; but with only a dplyr background it is easy get stuck and/or accidentally trigger avoidable data copies and end up with sub-optimal performance.

My personal preference for larger (and larger-than-memory) datasets is DuckDB, it is an analytical in-process SQL database, and a fast one. Might sound intimidating, but in-process means that it is essentially effortless to install and use, for many common use cases there's no persistent DuckDB database, just use it to open existing CSV / TSV / JSON / Parquet / ... file(s), do your data thing, write results directly to file(s) or pull into R as a data.frame. It's already widely used for many types of data tasks, meaning it's easy to find help (at least for more generic DuckDB / SQL questions, maybe bit less so for anything dplyr-related that involves translation).

Perhaps start with duckplyr and its Vignettes and examples, its aim is to become a drop-in replacement for dplyr. The earlier warning about dplyr syntax as an abstraction layer still applies and one should try to process as much as possible in DuckDB and avoid early materialization in R. Knowing SQL helps to fully use DuckDB potential, but one can get quite far with just duckplyr, it provides access to most Functions – DuckDB in one way or another.


And of course there's compute time vs human time. If your current process is just slow but still works, is it worth investing hours/days/weeks/... to improve it just save a few minutes/hours over the next year or so? There are of course other factors (cummulative compute cost, cost of the delay in some foreseeable future, risks from new/experimental/unstable stack, risks from obsolete stack, etc)


Also, few related chapters from R4DS that might help:

1 Like

I think @prubin gave you a great hint. Try to load your data into a database like PostgreSQL, MySQL or similar. Then you have a great advantage of using lazy tables with dplyr (dbplyr). The processes of selecting rows and columns will take place in the database which will be most likely well-optimized by the database engine. You work only with relevant rows this way, not needing to process 150M rows in R

Thanks so much for your detailed answer, @margusl. This is very useful.
It is very interesting about parallelism. I am actually very familiar with parallelism and synchronization on shared object (have a PhD in that field), but was always told that R is designed for single threaded execution, and does not have any way to protect shared data from concurrent accesses, etc. At some point I got familiar with "jobs" in RStudio, but that was pretty close to running to R sessions as the only way that objects were "shared" is that they were copied on request.
This is why I assumed that all the deploy, ldply, etc, do not support parallelism, even though it seems like a classic use case for it -- especially as the function code for each chunk is running in a separate environment. That said, no one can prevent the user to update objects at the parent environment (via <<- assignment, for example), so it is not "bullet proof" safe to run any code in d_ply in parallel. I was not aware of the option to add parallelism to these iteration functions, and will look into it. I wrote thousands of code lines in R assuming that could have easily benefit from parallelism but assumed it is not supported. (I do use multiple R sessions in parallel though starting from a common state, similar to the concept of jobs)
Memory usage: I found that breaking my data into smaller pieces helps and it is even necessary with respect to memory usage. Sometimes you cannot run a subset operation over the whole data frame (running out of memory error), and have to do it chunk by chunk. (On that aspect, parallel execution may actually work against me, as it would require total memory usage that is larger than if operating chunk by chunk.)

One thing that I wondered about, when reading your response: you mentioned both data.table and data frames (or maybe it was only me using data frames) -- are the two the same, or that there is a difference and/or advantage of one over the other? Are the same d*ply iterator works on both?
I have very little practical experience in working with data bases and SQL in particular. I know it is trivial once you get used to it, but I just never had to.

As of time investment: this work in R is not my main occupation, it is some side research I'm doing in the stock market. I took some time many years ago to climb the "steep hill" of learning R and how it works (read the famous Advanced R book), but since then mostly learned on the go, and could never afford committing upfront to putting much effort to build infrastructure for my research. Every once in a while I build some generic utility, but as a side project, it is not something that I can do very efficiently.

Finally, as for some examples of what I'm doing: I don't think that providing actual R code in this case would help much, but here is a description of the type of data and examples of how I'm using it:

  • Data: historical quotes of options for given underlying(s) equities in the stock market
  • Options is specified by multiple parameters, including the name of the underlying, option expiration, strike price, type of option (call or put), etc.
    This is why I may have many rows even for a one timestamp, as I could have prices for options with different strike values or expiration -- and my timing resolution is minutes, which is why even one year of quotes inflates to huge amount of data.
  • Most of my use examples are evaluating values of various option strategies over a period of time. Options strategy includes at least 2 different options, maybe sharing a strike, expiration, etc, and its value is the sum of the values of these options.
    What it means is that I usually need to dig in for certain options combination, at a given start time, and track the price over some period of time. More importantly, I'd like to explore how small changes to a strategy affects its value and PnL. So, overall I track very small fraction of the data at a time, but there is no one particular structure common to all usages and thus an obvious criteria to divide the data by (dividing the quotes by months, for example, will not work great when you're looking at a strategy that begins at one one month and expire in the next).
    So, unfortunately, I don't think that there is an obvious way to structure/split my data for all strategies I explore, especially as I want to compare sometimes the same strategy on different time period (meaning different quotes).

Hope it sheds some light on what I am dealing with. As said, I don't think that code example in this case will be useful as there are too many parameters and a full example of building a strategy would require quite a bit of code.

Thank you again for all the pointers and advice. Will definitely look into that, especially if I end up working with that amount of data for a while.

data.table is an extension of data.frame. It internally uses multi-threaded C code, can be much faster than base R or dplyr, but can't handle more data than your RAM allows.
So, if you have enough RAM to hold your dataset, I would go with data.table, and use an external database if not.
data.table uses a particular syntax you'll have to get used to, but AI could help you there - even the basic (i.e. free) ChatGPT is quite knowledgeable about data.table.

1 Like

Great, thank you @zivan . Will check it out.

1 Like

This is super important and it is binding; if you are near the limits, creating new variables and such can limit and massively slow you down. The syntax is strange at first, but well supported and documented. Databases are the go-to if not; I learned Arrow to sort it out and it worked nicely.

1 Like

R is single-threaded and explicit parallelism is implemented by running multiple isolated R worker processes (not necessarily local) in parallel. While there's some overhead of data (de)serialisation and transport, this kind of parallelism happens to go nicely with R's functional paradigm where one is expected to avoid side effects in functionals ( lapply() & co, purrr::*map() ), meaning that in many cases it is easy to switch from regular serial functionals to parallel variants (base parallel::par*() and parallel::mc*() for base *apply() and Map(); furrr provides parallel variants for purrr functionals, but purrr also supports parallel execution through mirai; CRAN Task View: High-Performance and Parallel Computing with R | Explicit parallelism).

Package status of plyr is retired. In principle, parallelizing plyr::ddply() shouldn't be too much work, but when it comes to dplyr / Tidyverse, I'd rather go with dplyr::group_split() or tidyr::nest()and relevant purrr or base functional, which can then easily to modified for parallel execution.

Note that with proper use of data.table, arrow and/or DuckDB you may benefit from some implicit parallelism for free, i.e. no need to bother about it yourself, perhaps just control the number of maximum available threads. Well, real life can be bit more nuanced as apparently similar methods / approaches with identical results can perform quite differently, e.g. data.table fast methods vs R generics. It also takes special care when mixing implicit and explicit parallelism.


Also note that often there is no need to load the entire multi-GB dataset into memory. Pre-processing, storage choices and partitioning of your input data can make quite a difference here, regardless of the route you take for compute-heavy processing during analysis. For example, you can first save your input data as Parquet dataset, partitioned by year and month (or whatever variables are commonly used in your filters / groupings and that produce sensibly sized partitions, R4DS: Arrow provides some pointers). Start analysis steps with arrow::open_dataset() , apply filters and materialize the result as R data.frame and only metadata and the relevant subset is pulled into R. From there you can continue with base R, dplyr or data.table. Or skip data.frame and continue with arrow. And/or switch from arrow to DuckDB, this happens without data copy. Or read Parquet dataset directly with DuckDB.

As you can use dplyr syntax with different backends, you can get started with data.table, arrow and DuckDB using the dplyr syntax you already know, even if it's just to get an idea of how those backends perform. From your dataset and task description, it looks like you might be (or perhaps should be) using rolling user functions and/or packages for time-series analysis and modelling. If so, then additional dplyr layer might get in the way sooner rather than later and it can limit which, if any, out-of-R-process data-handling options are still feasible and to what extent.