This is a companion discussion topic for the original entry at https://www.tidyverse.org/blog/2022/02/tidyr-1-2-0
We’re chuffed to announce the release of tidyr 1.2.0. tidyr provides a set of tools for transforming data frames to and from tidy data, where each variable is a column and each observation is a row. Tidy data is a convention for matching the semantics and structure of your data that makes using the rest of the tidyverse (and many other R packages) much easier.
You can install it from CRAN with:
install.packages("tidyr")
This blog post will go over the main new features, which include four new arguments to
pivot_wider()
, the ability to unnest multiple columns at once in
unnest_wider()
and
unnest_longer()
, an enhanced
complete()
function, and some updates to our tools for handling missing values.
You can see a full list of changes in the release notes, where you’ll also find details on the ~50 bugs that were fixed in this release!
New author
First off, we are very excited to welcome
Maximilian Girlich as a new tidyr author in recognition of his significant and sustained contributions. In particular, he played a large part in speeding up a number of core functions, including:
unchop()
,
unnest()
,
unnest_wider()
, and
unnest_longer()
. Additionally, he provided proof-of-concept implementations for a few new features, like the unused_fn
argument to
pivot_wider()
discussed below.
Pivoting
Value expansion
pivot_wider()
has gained two new arguments related to the expansion of values. These arguments are similar to drop = FALSE
from
spread()
, but are a bit more fine grained. As you’ll see, these are mostly useful when you have factors in either names_from
or id_cols
and want to ensure that all of the factor levels are retained.
weekdays <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
daily <- tibble(
day = factor(c("Tue", "Thu", "Fri", "Mon"), levels = weekdays),
value = c(2, 3, 1, 5)
)
daily
#> # A tibble: 4 × 2
#> day value
#>
#> 1 Tue 2
#> 2 Thu 3
#> 3 Fri 1
#> 4 Mon 5
Imagine you’d like to pivot the values from day
into columns, filling the cells with value
. By default,
pivot_wider()
only generates columns from the data that is actually there, and will retain the ordering that was present in the data.
pivot_wider(daily, names_from = day, values_from = value)
#> # A tibble: 1 × 4
#> Tue Thu Fri Mon
#>
#> 1 2 3 1 5
When you know the full set of possible values and have encoded them as factor levels (as we have done here), you might want to retain those levels in the pivot, even if there isn’t any data. Additionally, it would probably be nice if they were sorted to match the levels found in the factor. The new names_expand
argument handles both of these.
pivot_wider(daily, names_from = day, values_from = value, names_expand = TRUE)
#> # A tibble: 1 × 7
#> Mon Tue Wed Thu Fri Sat Sun
#>
#> 1 5 2 NA 3 1 NA NA
A related problem can occur when there are implicit missing factor levels in the id_cols
. When this happens, there are missing rows (rather than columns) that you’d like to explicitly represent. To demonstrate, we’ll modify daily
with a type
column, and pivot on that instead, keeping day
as an identifier column.
daily <- daily %>% mutate(type = c("A", "B", "B", "A"))
daily
#> # A tibble: 4 × 3
#> day value type
#>
#> 1 Tue 2 A
#> 2 Thu 3 B
#> 3 Fri 1 B
#> 4 Mon 5 A
In the pivot below, we are missing some rows corresponding to the missing factor levels of day
. Again, by default
pivot_wider()
will only use data that already exists in the id_cols
.
pivot_wider(
daily,
names_from = type,
values_from = value
)
#> # A tibble: 4 × 3
#> day A B
#>
#> 1 Tue 2 NA
#> 2 Thu NA 3
#> 3 Fri NA 1
#> 4 Mon 5 NA
To explicitly expand (and sort) these missing rows, we can use id_expand
, which works much the same way as names_expand
. We will also go ahead and fill the unrepresented values with zeros.
pivot_wider(
daily,
id_expand = TRUE,
names_from = type,
values_from = value,
values_fill = 0
)
#> # A tibble: 7 × 3
#> day A B
#>
#> 1 Mon 5 0
#> 2 Tue 2 0
#> 3 Wed 0 0
#> 4 Thu 0 3
#> 5 Fri 0 1
#> 6 Sat 0 0
#> 7 Sun 0 0
Varying names
When you specify multiple values_from
columns, the resulting column names that get generated from the combination of names_from
values and values_from
names default to varying the names_from
values fastest. This means that all of the columns related to the first values_from
column will be at the front, followed by the columns related to the second values_from
column, and so on. For example, if we wanted to flatten daily
all the way out to a single row by specifying values_from = c(value, type)
, then we would end up with all the columns related to value
followed by those related to type
.
pivot_wider(
daily,
names_from = day,
values_from = c(value, type),
names_expand = TRUE
)
#> # A tibble: 1 × 14
#> value_Mon value_Tue value_Wed value_Thu value_Fri value_Sat value_Sun type_Mon
#>
#> 1 5 2 NA 3 1 NA NA A
#> # … with 6 more variables: type_Tue , type_Wed , type_Thu ,
#> # type_Fri , type_Sat , type_Sun
Depending on your data, you might instead want to group all of the columns related to a particular names_from
value together. In this example, that would mean grouping all of the columns related to Monday together, followed by Tuesday, Wednesday, etc. You can accomplish this with the new names_vary
argument, which allows you to vary the names_from
values slowest.
pivot_wider(
daily,
names_from = day,
values_from = c(value, type),
names_expand = TRUE,
names_vary = "slowest"
)
#> # A tibble: 1 × 14
#> value_Mon type_Mon value_Tue type_Tue value_Wed type_Wed value_Thu type_Thu
#>
#> 1 5 A 2 A NA NA 3 B
#> # … with 6 more variables: value_Fri , type_Fri , value_Sat ,
#> # type_Sat , value_Sun , type_Sun
Unused columns
Occasionally you’ll find yourself in a situation where you have columns in your data that are unrelated to the pivoting process itself, but you’d still like to retain some information about them. Consider this data set that records values returned by various systems across multiple counties.
readouts <- tibble( county = c("Wake", "Wake", "Wake", "Guilford", "Guilford"), date = c(as.Date("2020-01-01") + 0:2, as.Date("2020-01-03") + 0:1), system = c("A", "B", "C", "A", "C"), value = c(3.2, 4, 5.5, 2, 1.2) )
readouts
#> # A tibble: 5 × 4
#> county date system value
#>
#> 1 Wake 2020-01-01 A 3.2
#> 2 Wake 2020-01-02 B 4
#> 3 Wake 2020-01-03 C 5.5
#> 4 Guilford 2020-01-03 A 2
#> 5 Guilford 2020-01-04 C 1.2
You might want to pivot this into a view containing one row per county
, with the system
types across the columns. You might do something like:
pivot_wider(
readouts,
id_cols = county,
names_from = system,
values_from = value
)
#> # A tibble: 2 × 4
#> county A B C
#>
#> 1 Wake 3.2 4 5.5
#> 2 Guilford 2 NA 1.2
This worked, but in the process we’ve lost all of the information from the date
column about when the values were recorded. To fix this, we can use the new unused_fn
argument to retain a summary of the unused date
column. In our case, we’ll retain the most recent date a value was recorded across all systems.
pivot_wider(
readouts,
id_cols = county,
names_from = system,
values_from = value,
unused_fn = list(date = max)
)
#> # A tibble: 2 × 5
#> county A B C date
#>
#> 1 Wake 3.2 4 5.5 2020-01-03
#> 2 Guilford 2 NA 1.2 2020-01-04
If you want to retain the unused columns but delay the summarization entirely, you can use
list()
to wrap up the value into a list column.
pivot_wider(
readouts,
id_cols = county,
names_from = system,
values_from = value,
unused_fn = list
)
#> # A tibble: 2 × 5
#> county A B C date
#>
#> 1 Wake 3.2 4 5.5
#> 2 Guilford 2 NA 1.2
Note that for unused_fn
to work, you must supply id_cols
explicitly, as otherwise all of the remaining columns are assumed to be id_cols
.
More informative errors
We’ve improved on a number of the error messages throughout tidyr, but the error you get from
pivot_wider()
when you encounter values that aren’t uniquely identified is now especially nice. Let’s “accidentally” add a duplicate row to readouts
.
readouts2 <- readouts %>% slice(seq_len(n()), n())
readouts2
#> # A tibble: 6 × 4
#> county date system value
#>
#> 1 Wake 2020-01-01 A 3.2
#> 2 Wake 2020-01-02 B 4
#> 3 Wake 2020-01-03 C 5.5
#> 4 Guilford 2020-01-03 A 2
#> 5 Guilford 2020-01-04 C 1.2
#> 6 Guilford 2020-01-04 C 1.2
Pivoting on system
warns us that the values from value
are not uniquely identified.
pivot_wider(
readouts2,
id_cols = county,
names_from = system,
values_from = value
)
#> Warning: Values from `value` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = {summary_fun}` to summarise duplicates.
#> * Use the following dplyr code to identify duplicates.
#> {data} %>%
#> dplyr::group_by(county, system) %>%
#> dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
#> dplyr::filter(n > 1L)
#> # A tibble: 2 × 4
#> county A B C
#>
#> 1 Wake
#> 2 Guilford
This provides us with a number of options, but the last one is particularly useful if we weren’t expecting duplicates. This prints out a block of dplyr code that you can use to quickly identify duplication issues. Replacing {data}
with readouts2
, we get:
readouts2 %>%
dplyr::group_by(county, system) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
#> # A tibble: 1 × 3
#> county system n
#>
#> 1 Guilford C 2
(Un)nesting
unnest_longer()
and
unnest_wider()
have both gained the ability to unnest multiple columns at once. This is particularly useful with
unnest_longer()
, where sequential unnesting would instead result in a Cartesian product, which isn’t typically desired.
# Sequential unnesting df %>% unnest_longer(x) %>% unnest_longer(y) #> # A tibble: 5 × 2 #> x y #> #> 1 1 1 #> 2 1 1 #> 3 1 2 #> 4 2 1 #> 5 2 2
# Joint unnesting
df %>%
unnest_longer(c(x, y))
#> # A tibble: 3 × 2
#> x y
#>
#> 1 1 1
#> 2 1 1
#> 3 2 2
Grids
When
complete()
-ing a data frame, it’s often useful to immediately fill the newly generated missing values with a value that better represents their intention. For example, with the daily
data we could complete on the day
factor column and insert zeros for value
in any row that wasn’t previously represented.
daily #> # A tibble: 4 × 3 #> day value type #> #> 1 Tue 2 A #> 2 Thu 3 B #> 3 Fri 1 B #> 4 Mon 5 A
daily %>%
complete(day, fill = list(value = 0))
#> # A tibble: 7 × 3
#> day value type
#>
#> 1 Mon 5 A
#> 2 Tue 2 A
#> 3 Wed 0 NA
#> 4 Thu 3 B
#> 5 Fri 1 B
#> 6 Sat 0 NA
#> 7 Sun 0 NA
But what if there were already missing values before completing? By default,
complete()
will still fill those explicit missing values too.
daily2 <- daily daily2$value[nrow(daily2)] <- NA daily2 #> # A tibble: 4 × 3 #> day value type #> #> 1 Tue 2 A #> 2 Thu 3 B #> 3 Fri 1 B #> 4 Mon NA A
daily2 %>%
complete(day, fill = list(value = 0))
#> # A tibble: 7 × 3
#> day value type
#>
#> 1 Mon 0 A
#> 2 Tue 2 A
#> 3 Wed 0 NA
#> 4 Thu 3 B
#> 5 Fri 1 B
#> 6 Sat 0 NA
#> 7 Sun 0 NA
To avoid this, you can now retain pre-existing explicit missing values with the new explicit
argument:
daily2 %>%
complete(day, fill = list(value = 0), explicit = FALSE)
#> # A tibble: 7 × 3
#> day value type
#>
#> 1 Mon NA A
#> 2 Tue 2 A
#> 3 Wed 0 NA
#> 4 Thu 3 B
#> 5 Fri 1 B
#> 6 Sat 0 NA
#> 7 Sun 0 NA
Missing values
The three core missing values functions,
drop_na()
,
replace_na()
, and
fill()
, have all been updated to utilize
vctrs. This allows them to work properly with a wider variety of types, and makes them safer to use with some of the existing types that they already supported.
As an example,
fill()
now works properly with the Period types from
lubridate:
library(lubridate, warn.conflicts = FALSE)
df %>%
fill(x, .direction = "down")
#> # A tibble: 5 × 1
#> x
#>
#> 1 1S
#> 2 2S
#> 3 2S
#> 4 4S
#> 5 4S
And it now treats NaN
like any other missing value:
df <- tibble(x = c(NaN, 2, NA, 3))
df %>%
fill(x, .direction = "up")
#> # A tibble: 4 × 1
#> x
#>
#> 1 2
#> 2 2
#> 3 3
#> 4 3
The most drastic improvement in safety comes to
replace_na()
. Previously, this relied on [<-
to replace missing values with a replacement value, which is much laxer than vctrs in terms of what the replacement value can be. This resulted in the possibility for your column type to change depending on what your replacement value was.
# Notice that this is an integer column
df <- tibble(x = c(1L, NA, 3L))
df
#> # A tibble: 3 × 1
#> x
#>
#> 1 1
#> 2 NA
#> 3 3
# Previous behavior without vctrs:
# Integer column changed to character column
df %>% replace_na(list(x = "missing"))
#> # A tibble: 3 × 1
#> x
#>
#> 1 1
#> 2 missing
#> 3 3
# Integer column changed to double column
df %>% replace_na(list(x = 1))
#> # A tibble: 3 × 1
#> x
#>
#> 1 1
#> 2 1
#> 3 3
With vctrs, we now ensure that the replacement value is always cast to the type of the column you are replacing in. This ensures that the column types remain the same before and after you replace any missing values.
# New behavior with vctrs:
# Error, because "missing" can't be converted to an integer
df %>% replace_na(list(x = "missing"))
#> Error: Can't convertreplace$x
to match type ofdata$x
.
# Integer column type is retained, and the double value of
1
is
# converted to an integer replacement value of1L
df %>% replace_na(list(x = 1))
#> # A tibble: 3 × 1
#> x
#>
#> 1 1
#> 2 1
#> 3 3
Acknowledgements
Thanks to the 25 people who contributed to this version of tidyr by discussing ideas and suggesting new features! @aliaamiri, @allenbaron, @bersbersbers, @cjburgess, @DanChaltiel, @edzer, @eshom, @gaborcsardi, @gergness, @ggrothendieck, @iago-pssjd, @issactoast, @joiharalds, @LuiNov, @LukasWallrich, @mgirlich, @MichaelChirico, @NFA, @olehost, @psads-git, @psychelzh, @ramiromagno, @romainfrancois, @TimTaylor, and @xiangpin.