Column-major ordering in R and implications for tidyverse

igor · May 10, 2018, 11:57pm

I just learned about column-major ordering in R. I haven't seen anything about it in the context of dplyr. Since the order of values within a column affects their actual position in memory, does it mean that arranging the data within a tibble can affect performance? For example, for any group_by operations, would it help if the grouping elements are in order? Can you actually achieve any noticeable improvements in real-world usage?

jennybryan · May 11, 2018, 1:09am

A real answer to this would require empirical benchmarking. And yet, even then, you couldn't really count on the answer for long, because it's not a property of dplyr that is actively managed or guaranteed.

Warning: I am speculating now! I would guess that if your data is of a size where this affects performance in a noticeable way, you should be investigating more options than "dplyr: should I pre-sort or not?" and considering much more dramatic changes to the workflow.

igor · May 11, 2018, 2:04am

I was hoping I stumbled upon a power user trick, but I assume if it made a significant improvement, more people would be talking about it. I was mostly just curious if this is something that is being considered and how.