Formatting data for a data package - long vs wide

Hi there,

I'm developing my first R package to learn more about the process. I'm using data published in annual reports from a local government agency. I'm going to split the data up into a few data frames to group related data together. I've found some great tutorials that cover the technical elements of creating packages. But my question is how do I format these dataframes?

I have two options:

A wider format where each row is a year, and each variable is a column.

year var1 var2
2025 5 10
2024 4 9
2023 3 8

A longer format where variables are listed under 'category' and a single counts column:

year category count
2025 var1 5
2025 var2 10
2024 var1 4
2024 var2 9

What is best practice for including data in packages? When doing my own analyses I often work with ggplot2 which is designed to work with long format data. But I've looked at some built-in data packages and they tend to use wide format. I can't find any discussion or recommendations about this topic so I'm grateful for any advice.

Each variable does not have its own column in the first example. Both var1 and var2 are the combination of two variables: count and category, and are separated into two columns. Your second example is an actual case of tidy data, where each variable has its own column.

There's no "best practice" per se on how to structure data in packages. A poor practice would be to use a data structure that's not suitable for your package though. Simply use the data structure that suits your package and exported functions. If your package is designed to be used with the tidyverse, then you should definitely use a tidy data structure because it's the tidyverse standard.