I just learned about column-major ordering in R. I haven't seen anything about it in the context of dplyr. Since the order of values within a column affects their actual position in memory, does it mean that arranging the data within a tibble can affect performance? For example, for any group_by
operations, would it help if the grouping elements are in order? Can you actually achieve any noticeable improvements in real-world usage?
A real answer to this would require empirical benchmarking. And yet, even then, you couldn't really count on the answer for long, because it's not a property of dplyr that is actively managed or guaranteed.
Warning: I am speculating now! I would guess that if your data is of a size where this affects performance in a noticeable way, you should be investigating more options than "dplyr: should I pre-sort or not?" and considering much more dramatic changes to the workflow.
4 Likes
I was hoping I stumbled upon a power user trick, but I assume if it made a significant improvement, more people would be talking about it. I was mostly just curious if this is something that is being considered and how.