If wide is the preferred data format in the tidyverse, why do some packages require “long” format data frames?

I'm having some conceptual issues around data formats, even within the tidyverse.

My data is in what I understand is called "wide" format, which each variable in its own column. This is, I understand, the "preferred" data format for R packages, as described in r4ds ([tidy-data]https://r4ds.had.co.nz/tidy-data.html).

What I'm failing to grok is the need to transform this into "long" format when I wish to invoke ggplot (or, at least its boxplot function).

I understand (I think!) how to do this, but I'm totally failing to see why, or indeed to know what other packages might require their data transformed in this way.

I think that the tidyverse prefers that data be tidy. Depending on the data's original format, it may be necessary to make it wider or longer to make it tidy. Can you give an example of data that is tidy but needs to be made longer to be used in ggplot?

I have seen cases where my understanding of the data changes so that what seemed tidy at first may not be.

Long is the preferred data format, not wide.

What is long or wide can depend on the context though.

I think FJCC is right on the mark here. Tidy datasets are those that meet the following criteria regardless of whether they are long or wide form (which again is a relative term).

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

I think this quote from Hadley's book is the key. Tidy data is simply easier to manipulate into other forms.

There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.