Many times I get the following error message when I use the spread function: "Error: Duplicate identifiers for rows" and inevitably have to use dcast. Why does dcast work and spread kick back errors when I use the same data set?
There's a really nice explanation by @alistaire here (just a quick excerpt below)
spread
doesn't apply a function to combine multiple values (à ladcast
), so rows must be indexed so there's one or zero values for a location
This would be a minimal example of code that raises this error.
library(dplyr)
library(tidyr)
wide <- data_frame(
number = c(1:3, 1:3),
value = letters[1:6])
wide
#> # A tibble: 6 x 2
#> number value
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 1 d
#> 5 2 e
#> 6 3 f
long <- wide %>%
gather(variable, value, -number)
long
#> # A tibble: 6 x 3
#> number variable value
#> <int> <chr> <chr>
#> 1 1 value a
#> 2 2 value b
#> 3 3 value c
#> 4 1 value d
#> 5 2 value e
#> 6 3 value f
long %>%
spread(variable, value)
#> Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6)
One way to think about the problem is that there is more than one way to spread the data when there are repeated identifiers. For example, which of these is correct?
#> # A tibble: 6 x 2
#> number value
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 1 d
#> 5 2 e
#> 6 3 f
#> # A tibble: 6 x 2
#> number value
#> <int> <chr>
#> 1 1 d
#> 2 2 e
#> 3 3 f
#> 4 1 a
#> 5 2 b
#> 6 3 c
I'm not sure if this is the exact reason why rows must be uniquely identifiable, but it makes sense. tidyr is a little stricter in some ways than reshape2.
Fortunately, we can make rows uniquely identifiable by just numbering them with something like tibble::rowid_to_column()
.
long <- wide %>%
tibble::rowid_to_column() %>%
gather(variable, value, -number, -rowid)
long %>%
spread(variable, value)
It may be worth putting in a pull request to tidyr to have a more friendly error message that states:
long %>%
spread(variable, value)
#> Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6).
#> Consider adding a unique ID with tibble::rowid_to_column()
At it's core, tidyr is assuming that your dataset is in 3NF, or third normal form where there is a unique identifier, or at least a combination of values that could form a unique identifier for each row(which could easily be created via a simple mutate call), but when it cannot find an identifier, it simply tells you it is not possible.
...except that's rarely the identifier column actually needed, as spreading afterwards just shifts values sideways and inserts a lot of NA
s. What actually is needed seems to vary a lot depending on the data in question.
As a side note, I'd really like tibble::rownames_to_column
to be able to run type.convert
in case there are no names or row names are otherwise numeric. It wouldn't be type-safe, but it could be put in as a convert = TRUE
parameter like that of spread
and gather
. I'd write the PR if the maintainers would accept it, so I guess I should open an issue, but that's a lot of typing and I'm tired.
On its own, tibble::rowid_to_column()
assigns a unique id to every single row, which actually cancels out most of the spreading. New columns are created, but there are as many rows as before, and there are more NA
s than I want. Instead, I use group_by(...) %>% do(tibble::rowid_to_column(.)) %>% spread(...)
. That way, id's are only unique where they disambiguate rows, and spread()
seems to do what I want.
Also, in my situation, the following are equivalent and row order does not matter.
#> # A tibble: 6 x 2
#> number value
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 1 d
#> 5 2 e
#> 6 3 f
#> # A tibble: 6 x 2
#> number value
#> <int> <chr>
#> 1 1 d
#> 2 2 e
#> 3 3 f
#> 4 1 a
#> 5 2 b
#> 6 3 c
I am not convinced that the current solution is acceptable. I find it very difficult to explain to newcomers why spread
has difficulties spreading what gather
has just gathered.
Lets take example from SO question @mara quoted above:
library(tidyverse)
df <- structure(list(age = c("21", "17", "32", "29", "15"),
gender = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor")),
row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("age", "gender"))
df
#> # A tibble: 5 x 2
#> age gender
#> <chr> <fct>
#> 1 21 Male
#> 2 17 Female
#> 3 32 Female
#> 4 29 Male
#> 5 15 Male
df %>%
spread(key=gender, value=age)
#> Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)
Ok, so for some strange reason tidyr
needs another index to do the spread
. Fine! Why can't it intercept the call and do something along the lines of:
df %>%
group_by_at(vars(-age)) %>% # group by everything other than the value column.
mutate(row_id=1:n()) %>% ungroup() %>% # build group index
spread(key=gender, value=age) %>% # spread
select(-row_id) # drop the index
#> # A tibble: 3 x 2
#> Female Male
#> <chr> <chr>
#> 1 17 21
#> 2 32 29
#> 3 NA 15
Am I missing something?
Thanks so much dmi3k! Your solution for my tidyr "spread" problem worked like a champ on the data set I had in mind in my original October 2017 posting. Absolute keeper!!
Yup, thank you. Came here from an error in scanstatistics::df_to_matrix which was using spread like this and both a) the super helpful minimal example and b) the solution were useful in identifying and fixing the problem. Somebody should really get that pull request to TidyR though
Very glad that you were able to get squared away. @dmi3k's solutuion was perfect. Have never had to worry about those nasty "Duplicate identifiers" since.
This isn't the default behavior because it assumes a significance to row order within groups that is not explicit in the data itself. (That's the row_id
index it creates.) In this case it's pretty dubious there's a connection between 17 and 21 and thus that the rows in wide form mean anything at all. This data is better left in long form.
There are cases where the assumption may be good, but that means the data is not tidy, as there is a variable which is not explicit.