We have a data analytics pipeline, and we're struggling with it being almost but not quite standardised.
We go though various ingestion / cleaning / analytics steps which are often quite similar across different data analysis projects, but there's enough project-specific variation that we can't just functionalise the lot.
Each step is usually a combination of markdown and one or more code chunks.
- some steps are pretty standardised in format (e.g. given our standard data input, plot and tabulate a certain statistic) but they will only be needed in certain projects
- some steps are similar but will need some manual code editing depending on the project - exactly which bit needs editing is variable enough that it's difficult to parametrise
- some steps are completely project-specific
- some steps were developed specifically for one project and later we found we could recycle them for a different one
At the moment, we're doing a lot of copy-pasting (terrible for VC), rewriting the same code (waste of time), working of a "default template" which doesn't capture most of the chunks/modules we've developed over time, and overall being frustrated and unhappy.
We could work toward more functionalisation - however this makes edits tricky, and (personal opinion) I think hiding a lot of processing steps behind a function call can make the pipeline code quite hard to interpret.
We have explored Rmarkdown child documents / Quarto includes - is this the only way? They always feel like a minor functional feature and I'm a bit worried about hanging our entire pipeline structure on them.
What I think I'm envisioning is a "packaging system" for Rmd child documents, with a formalised documentation structure, and the ability to use the step document as-is or with modification. Is there any (non-absurd) way to repurpose an R package to work this way? Are we better off with a child document library repo that we fork for every project? Is there another way to accomplish this that I'm not thinking of?