One or many internal packages?

In my organization we have an internal package that does a lot of things:

  • Plotting
  • Creating tables
  • Report templates
  • Connecting to databases
  • Modelling and statistics
  • Shiny-stuff

We're about to split this package into a subset of packages in order to make it easier to maintain and try to follow the principles outlined in R Packages: "We believe that packages that have a wide audience should strive to do one thing and do it well."

However, I suppose this can end up at the other end as well, where we have too many small packages that gets outdated and are harder to maintain.

So I'm curious in how organizations with a lot of internal code organize their packages?

Take plotting and making tables as an example. One might argue that there should be one package for plots and one package for tables because they do two things. But tables and plots are often created in the same analysis, or in a Shiny app. So you might argue that they should be one package for creating visualizations.

I realize that this is very context-driven but I'm just curious to hear if someone wants to share how their organizations work with organizing internal packages.

1 Like

Great question @filipwastberg. As you noted this is a very context-driven question. For you and anyone else considering breaking up a large package into multiple smaller ones, these are the questions I would recommend asking yourself to determine if the pros will outweigh the cons.

How independent are the developers?

Is the large package developed by many individuals scattered across departments and timezones? Do certain individuals/teams only work on specific functions of the large package? If yes, I think splitting up into small packages would be beneficial because this would allow the independent developers to work faster, since they could review and merge their PRs without worrying about breaking the code of another team code. For example, let's imagine that a team of data engineers in time zone 1 develop the database connection code, a team of statisticians in time zone 2 develop the modelling and statistics code, and a team of data analysts in time zone 3 develop the plotting/tables/reports/shiny code. I think this arrangement would benefit from splitting into multiple smaller packages.

On the other hand, if the large package is developed by a small number of individuals in the same department and time zone, then splitting into multiple smaller packages is more of a con since it mainly just creates additional packages to maintain and release.

How independent is the code?

How often does a change in one part of the code require changes in other parts? For example, what if the studies started collecting a new variable? Presumably this would require updating the database connection code to fetch this new column, the modelling code to add it to the statistical model, and the plotting/tables/reports/shiny to include it in the report artifacts. If this type of dependent changes happens often, then I think it is easier to edit the single larger package. Doing one big find-replace and then bumping the version number of the large package is much simpler compared to doing a find-replace in each separate package, and then also requiring a minimum version of the dependent packages (e.g. you couldn't use the latest version of the modelling package with an outdated version of the database connection package because it won't return the newly required variable).

On the other hand, if the data structure rarely changes, and most PRs to the large repo only affect a single functionality, then splitting into separate packages would be more beneficial.

How robust is your internal CI/CD?

Is it easy to run CI/CD pipelines on your internal R packages? Would it be possible to setup integration pipelines that test the latest versions of the multiple smaller packages together? How painful is it to release a new version of the internal R package to your end users?

If you have the infrastructure to conveniently test and deploy your R packages, then having more isn't a big deal. But if testing a package is cumbersome, and deploying is a painful, multi-step process, then releasing new versions of multiple separate packages is going to be much more work.