tidyr::pivot_wider_spec with selection gives unexpected results

HanOostdijk · March 19, 2020, 2:47pm

I can use tidyr::pivot_wider and tidyr::pivot_wider_spec and both give the same result (see df1 and df2 in the reprex below). However when I apply a filter on the specification specw I only get a good result (see df3 ) if I specify an additional summarize function (e.g. values_fn = list(seen = max) ). When I do not do this I get warnings (see df4) and a result with lists instead of numbers.

This happens both in the current CRAN version of tidyr as in the development version that I downloaded today. See for the R environment the session info attached below. I also use a recent version of RStudio (1.3.904) but this is probably not relevant.

I might be a bug, because I think it should be possible to apply a filter like done here, but maybe I am wrong. Any ideas?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

df1= tidyr::pivot_wider(fish_encounters,id_cols=c('fish'),
   names_from = station,values_from = seen,
   values_fill = c('seen'=0L) 
   )
head(df1)
#> # A tibble: 6 x 12
#>   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
#>   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
#> 1 4842        1     1      1     1       1     1     1     1     1     1     1
#> 2 4843        1     1      1     1       1     1     1     1     1     1     1
#> 3 4844        1     1      1     1       1     1     1     1     1     1     1
#> 4 4845        1     1      1     1       1     0     0     0     0     0     0
#> 5 4847        1     1      1     0       0     0     0     0     0     0     0
#> 6 4848        1     1      1     1       0     0     0     0     0     0     0
warnings()

specw = tidyr::build_wider_spec(fish_encounters,
   names_from = station,values_from = seen
   )

df2 = tidyr::pivot_wider_spec(fish_encounters,specw,id_cols=c('fish'),
   values_fill = c('seen'=0L)  
   )
head(df2)
#> # A tibble: 6 x 12
#>   fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
#>   <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
#> 1 4842        1     1      1     1       1     1     1     1     1     1     1
#> 2 4843        1     1      1     1       1     1     1     1     1     1     1
#> 3 4844        1     1      1     1       1     1     1     1     1     1     1
#> 4 4845        1     1      1     1       1     0     0     0     0     0     0
#> 5 4847        1     1      1     0       0     0     0     0     0     0     0
#> 6 4848        1     1      1     1       0     0     0     0     0     0     0
warnings()

specw = specw %>%
  filter (!.name %in% c('Release','Lisbon') )


df3 = tidyr::pivot_wider_spec(fish_encounters,specw,id_cols=c('fish'),
   values_fill = c('seen'=0L),
   values_fn = list(seen = max)
)
head(df3)
#> # A tibble: 6 x 10
#>   fish  I80_1  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
#>   <fct> <int> <int>   <int> <int> <int> <int> <int> <int> <int>
#> 1 4842      1     1       1     1     1     1     1     1     1
#> 2 4843      1     1       1     1     1     1     1     1     1
#> 3 4844      1     1       1     1     1     1     1     1     1
#> 4 4845      1     1       1     0     0     0     0     0     0
#> 5 4847      1     0       0     0     0     0     0     0     0
#> 6 4848      1     1       0     0     0     0     0     0     0
warnings() 


df4 = tidyr::pivot_wider_spec(fish_encounters,specw,id_cols=c('fish'),
   values_fill = c('seen'=0L) 
)
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
head(df4)
#> # A tibble: 6 x 10
#>   fish  I80_1   Rstr    Base_TD  BCE     BCW     BCE2    BCW2    MAE     MAW    
#>   <fct> <list>  <list>  <list>   <list>  <list>  <list>  <list>  <list>  <list> 
#> 1 4842  <int [~ <int [~ <int [1~ <int [~ <int [~ <int [~ <int [~ <int [~ <int [~
#> 2 4843  <int [~ <int [~ <int [1~ <int [~ <int [~ <int [~ <int [~ <int [~ <int [~
#> 3 4844  <int [~ <int [~ <int [1~ <int [~ <int [~ <int [~ <int [~ <int [~ <int [~
#> 4 4845  <int [~ <int [~ <int [1~ <int [~ <int [~ <int [~ <int [~ <int [~ <int [~
#> 5 4847  <int [~ <int [~ <int [1~ <int [~ <int [~ <int [~ <int [~ <int [~ <int [~
#> 6 4848  <int [~ <int [~ <int [1~ <int [~ <int [~ <int [~ <int [~ <int [~ <int [~
warnings()

^{Created on 2020-03-19 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.0 (2019-04-26)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       Europe/Berlin               
#>  date     2020-03-19                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source                          
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.6.0)                  
#>  backports     1.1.5      2019-10-02 [1] CRAN (R 3.6.1)                  
#>  callr         3.3.2      2019-09-22 [1] CRAN (R 3.6.1)                  
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 3.6.3)                  
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                  
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 3.5.0)                  
#>  devtools      2.2.1      2019-09-24 [1] CRAN (R 3.6.0)                  
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 3.6.3)                  
#>  dplyr       * 0.8.5      2020-03-07 [1] CRAN (R 3.6.0)                  
#>  ellipsis      0.3.0      2019-09-20 [1] CRAN (R 3.6.1)                  
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 3.6.1)                  
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 3.6.2)                  
#>  fs            1.3.1      2019-05-06 [1] CRAN (R 3.6.0)                  
#>  glue          1.3.2      2020-03-12 [1] CRAN (R 3.6.3)                  
#>  highr         0.8        2019-03-20 [1] CRAN (R 3.5.3)                  
#>  htmltools     0.4.0      2019-10-04 [1] CRAN (R 3.6.1)                  
#>  knitr         1.28       2020-02-06 [1] CRAN (R 3.6.0)                  
#>  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 3.6.3)                  
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.0)                  
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.0)                  
#>  pillar        1.4.3      2019-12-20 [1] CRAN (R 3.6.2)                  
#>  pkgbuild      1.0.6      2019-10-09 [1] CRAN (R 3.6.1)                  
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 3.6.1)                  
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.1)                  
#>  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.0)                  
#>  processx      3.4.1      2019-07-18 [1] CRAN (R 3.6.1)                  
#>  ps            1.3.0      2018-12-21 [1] CRAN (R 3.5.2)                  
#>  purrr         0.3.3      2019-10-18 [1] CRAN (R 3.6.1)                  
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 3.6.1)                  
#>  Rcpp          1.0.4      2020-03-17 [1] CRAN (R 3.6.0)                  
#>  remotes       2.1.0      2019-06-24 [1] CRAN (R 3.6.1)                  
#>  rlang         0.4.5      2020-03-01 [1] CRAN (R 3.6.3)                  
#>  rmarkdown     2.1        2020-01-20 [1] CRAN (R 3.6.2)                  
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                  
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.1)                  
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 3.6.2)                  
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.5.2)                  
#>  testthat      2.3.2      2020-03-02 [1] CRAN (R 3.6.3)                  
#>  tibble        2.1.3      2019-06-06 [1] CRAN (R 3.6.1)                  
#>  tidyr       * 1.0.2.9000 2020-03-19 [1] Github (tidyverse/tidyr@bcfc5f0)
#>  tidyselect    1.0.0      2020-01-27 [1] CRAN (R 3.6.3)                  
#>  usethis       1.5.1      2019-07-04 [1] CRAN (R 3.6.1)                  
#>  utf8          1.1.4      2018-05-24 [1] CRAN (R 3.5.1)                  
#>  vctrs         0.2.4      2020-03-10 [1] CRAN (R 3.6.0)                  
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                  
#>  xfun          0.10       2019-10-01 [1] CRAN (R 3.6.1)                  
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.1)                  
#> 
#> [1] D:/tools/R/Packages
#> [2] D:/tools/R/R-3.6.0/library

siddharthprabhu · March 19, 2020, 3:44pm

@HanOostdijk I don't think it's a bug. The issue is that after removing the values Release and Lisbon from specw, pivot_wider() can no longer uniquely identify each observation during the pivoting operation.

That's why you're getting list columns in your output and need to specify an aggregation function via values_fn to fix it.

dromano · March 19, 2020, 4:47pm

Hi@HanOostdijk, my sense is that the pivot*_spec() commands are intended to be used with spec tables made with the companion build*_spec() commands, rather than with user-built spec tables, since there doesn't seem to be any guidance on their expected structure. In the source code, NA's replace the filtered-out variable names in your data, so as @siddharthprabhu suggests, rows are no longer unique. Of course, that's a choice made, and maybe NA'd variables could simply be excluded before building the returned table, but that's not the case in the current code.

HanOostdijk · March 19, 2020, 5:55pm

In the development version (that I used to see if the 'problem' from the standard version was solved) there is a little more guidance (with the description of pivot_wider_spec and build_wider_spec 'hidden' in that of pivot_wider). In it is given e.g. the example at the end that uses no build*_spec function at all.

But I think that it is not consistent of the developers to on the one hand say that the removed two values make the set no longer unique and on the other hand don't show these values in the result.
And of course it is no big deal: it is always possible to remove the observations before the call to the pivot_wider* function.

I will have to study the code (excellent suggestion by @dromano) or look at cdata package to get a feeling for the use of the pivot_*_spec function.

Thank you @dromano and @siddharthprabhu for your quick and insightful replies.
I think you both are right: this is not meant to be used as a filter for the columns (to keep).

Example from help:

# `pivot-wider_spec()` provides more control over column names and output format
# instead of creating columns with estimate_ and moe_ prefixes,
# keep original variable name for estimates and attach _moe as suffix
spec2 <- tibble(
  .name = c("income", "rent", "income_moe", "rent_moe"),
  .value = c("estimate", "estimate", "moe", "moe"),
  variable = c("income", "rent", "income", "rent")
)
us_rent_income %>%
  pivot_wider_spec(spec2)

dromano · March 19, 2020, 6:04pm

I agree that it seems strange to impose uniqueness requirements on data that is thrown away, and from walking through the code with debug(), it looks like it wouldn't too difficult find the right place to insert a line that filters out any variables set to NA before the output-building process starts. (Maybe an exercise for later!).

nirgrahamuk · March 19, 2020, 6:19pm

my understanding is that 99% of users who are doing analysis, or composing higher level software would reach for pivot_wider typically, and the 1% of users who are doing hardcore programming or building out packages for other users might delve into the pivot_wider_spec area.
Do you fit more in the latter category than the former category ?
I get that your example is simplified just by nature of being a reprex
that said, applying a filter to the data, rather than the spec made from the data, achieves the same end, and works more naturally with pivot_wider as you've shown.

I'm wondering if its an oversight, which is always possible, or if it might be about the ability to 'guarantee' its unique, without necessarily checking ? perhaps there is some efficiency gain from this inconsistency ? speculative

HanOostdijk · March 19, 2020, 8:14pm

@nirgrahamuk: you are right. My use case was just study.
I was always having trouble with the 'gather' and 'spread' and this morning I was following some 'tidyr' lessons by Jane Wall. After lesson 2 about pivot_wider and pivot_longer I felt equipped to tackle the 'spec' versions. Apparently not ready yet

But enough to feel inspired to study more. I also installed the pivot vignette from the development version on github. I am not sure if it is already available in the production version. A lot of nice examples!
Study continues.

system · April 9, 2020, 8:14pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.