Formatting data for a data package - long vs wide

tesaunders · December 11, 2025, 7:48pm

Hi there,

I'm developing my first R package to learn more about the process. I'm using data published in annual reports from a local government agency. I'm going to split the data up into a few data frames to group related data together. I've found some great tutorials that cover the technical elements of creating packages. But my question is how do I format these dataframes?

I have two options:

A wider format where each row is a year, and each variable is a column.

year	var1	var2
2025	5	10
2024	4	9
2023	3	8

A longer format where variables are listed under 'category' and a single counts column:

year	category	count
2025	var1	5
2025	var2	10
2024	var1	4
2024	var2	9

What is best practice for including data in packages? When doing my own analyses I often work with ggplot2 which is designed to work with long format data. But I've looked at some built-in data packages and they tend to use wide format. I can't find any discussion or recommendations about this topic so I'm grateful for any advice.

arangaca · December 12, 2025, 2:00pm

Each variable does not have its own column in the first example. Both var1 and var2 are the combination of two variables: count and category, and are separated into two columns. Your second example is an actual case of tidy data, where each variable has its own column.

There's no "best practice" per se on how to structure data in packages. A poor practice would be to use a data structure that's not suitable for your package though. Simply use the data structure that suits your package and exported functions. If your package is designed to be used with the tidyverse, then you should definitely use a tidy data structure because it's the tidyverse standard.