I work with relatively large datasets: the main tibble is a tidy, long tibble (millions of rows) with three columns: Sample, Hash, number of occurrences. Each row is an observation. In order to plot, model, etc, I need to group samples according to their metadata, and hence there is another tibble with as many rows as samples (~100 rows and ~10 columns: Sample, Site, Month,...).
So, in order to, let's say, obtain a boxplot of the number of unique sequences per sample at each site, I will have to join the two tibbles. A workflow will look like
That works and gives me the plot I am after. My question is whether it is mandatory to formally join both tibbles, or if it is possible to generate the group_by of tibble #1 with the grouping information of tibble #2.
This may seem trivial but I think with really long tibbles, any new column added should increase the size of the object dramatically, right?
So my question is which is the tidiest way of dealing with relational tibbles
If each sample is in exactly one row of factor_tibble, then isn't group_by(sample, Site) the same as group_by(sample)?
In answer to your question, you could keep the tables separate to reduce memory usage and only join the necessary columns as part of a pipeline. That way, it takes much less memory which is (loosely) "freed up" after the pipe's done.
First, fair point on the sample/site group redundancy.
Second, your solution avoids storing the join tibble in a Robject and that's an improvement. I guess what I was hoping for is a way of passing the grouping without actually merging the tibbles, even under the hood.
There is beauty in normal form of data (... so help me Codd). Hard to improve upon in an abstract way.
Of course not all constraints are as hard today as they were in the late 1960's - e.g. storage costs mean something different now and then.
Still, in your case I would suggest offloading the main and the lookup tibbles into a database. Especially if you expect your work to involve a lot of grouping / subsetting. These are the kind of operations in which relational databases shine.
You will move from an environment of available memory constraint (R) to an environment of available disk space constraint (RDBS). In most use cases the hard drive wins.
About design of relational model you can have a quick web search on star schema - which is the simplest one (and describes nicely your situation - one fact table with one or more dimensions).
To get more fancy you can look up the great Inmon vs. Kimball controversy (hint: it is not much of a controversy, but why spoil a good story?)