grouped data splitting - observations

MikeKSmith · March 3, 2020, 3:37pm

Investigating the behaviour of group_by and grouped data along with splitting, nesting, etc. I was interested to see that output is sorted alphanumerically for splitting, but not for purrr::nest. In the example dataset I've arranged by descending homeworld to put Coruscant before Alderaan. From group_by, the group_keys are sorted alphabetically, and group_split works the same way as base R split - both sorting by the grouping variable first. However purrr:nest on grouped data does not sort. Just wanting to check whether this is expected behaviour, whether it's reasonable etc. It just caught me by surprise first time I saw it... but having seen that base R split basically does the same, I don't think it's a particular issue. Maybe just something to be aware of.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

by_homeworld <- starwars %>%
  filter(homeworld %in% c("Coruscant", "Alderaan")) %>%
  select(homeworld, name) %>%
  arrange(desc(homeworld)) %>%
  group_by(homeworld)

by_homeworld %>%
  group_keys()
#> # A tibble: 2 x 1
#>   homeworld
#>   <chr>    
#> 1 Alderaan 
#> 2 Coruscant

by_homeworld %>%
  ungroup()
#> # A tibble: 6 x 2
#>   homeworld name               
#>   <chr>     <chr>              
#> 1 Coruscant Finis Valorum      
#> 2 Coruscant Adi Gallia         
#> 3 Coruscant Jocasta Nu         
#> 4 Alderaan  Leia Organa        
#> 5 Alderaan  Bail Prestor Organa
#> 6 Alderaan  Raymus Antilles

by_homeworld %>%
  tidyr::nest()
#> # A tibble: 2 x 2
#> # Groups:   homeworld [2]
#>   homeworld data            
#>   <chr>     <list>          
#> 1 Coruscant <tibble [3 x 1]>
#> 2 Alderaan  <tibble [3 x 1]>

starwars %>%
  filter(homeworld %in% c("Coruscant", "Alderaan")) %>%
  select(homeworld, name) %>%
  split(.$homeworld)
#> $Alderaan
#> # A tibble: 3 x 2
#>   homeworld name               
#>   <chr>     <chr>              
#> 1 Alderaan  Leia Organa        
#> 2 Alderaan  Bail Prestor Organa
#> 3 Alderaan  Raymus Antilles    
#> 
#> $Coruscant
#> # A tibble: 3 x 2
#>   homeworld name         
#>   <chr>     <chr>        
#> 1 Coruscant Finis Valorum
#> 2 Coruscant Adi Gallia   
#> 3 Coruscant Jocasta Nu

by_homeworld %>%
  group_split()
#> [[1]]
#> # A tibble: 3 x 2
#>   homeworld name               
#>   <chr>     <chr>              
#> 1 Alderaan  Leia Organa        
#> 2 Alderaan  Bail Prestor Organa
#> 3 Alderaan  Raymus Antilles    
#> 
#> [[2]]
#> # A tibble: 3 x 2
#>   homeworld name         
#>   <chr>     <chr>        
#> 1 Coruscant Finis Valorum
#> 2 Coruscant Adi Gallia   
#> 3 Coruscant Jocasta Nu   
#> 
#> attr(,"ptype")
#> # A tibble: 0 x 2
#> # ... with 2 variables: homeworld <chr>, name <chr>

by_homeworld %>%
  group_split() %>%
  map_df(I)
#> # A tibble: 6 x 2
#>   homeworld name               
#>   <chr>     <chr>              
#> 1 Alderaan  Leia Organa        
#> 2 Alderaan  Bail Prestor Organa
#> 3 Alderaan  Raymus Antilles    
#> 4 Coruscant Finis Valorum      
#> 5 Coruscant Adi Gallia         
#> 6 Coruscant Jocasta Nu

^{Created on 2020-03-03 by the reprex package (v0.3.0)}

Session info

sessionInfo()
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 16299)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United Kingdom.1252 
#> [2] LC_CTYPE=English_United Kingdom.1252   
#> [3] LC_MONETARY=English_United Kingdom.1252
#> [4] LC_NUMERIC=C                           
#> [5] LC_TIME=English_United Kingdom.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] purrr_0.3.3 tidyr_1.0.2 dplyr_0.8.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.3       knitr_1.28       magrittr_1.5     tidyselect_1.0.0
#>  [5] R6_2.4.1         rlang_0.4.4      fansi_0.4.1      stringr_1.4.0   
#>  [9] highr_0.8        tools_3.6.2      xfun_0.12        utf8_1.1.4      
#> [13] cli_2.0.1        htmltools_0.4.0  yaml_2.2.1       assertthat_0.2.1
#> [17] digest_0.6.25    tibble_2.1.3     lifecycle_0.1.0  crayon_1.3.4    
#> [21] vctrs_0.2.3      glue_1.3.1       evaluate_0.14    rmarkdown_2.1   
#> [25] stringi_1.4.6    compiler_3.6.2   pillar_1.4.3     pkgconfig_2.0.3

mara · March 3, 2020, 4:58pm

I think you're comparing tibbles and lists, here. Yes, in the last example where you use map_df() you end up with a tibble, but it's essentially reassembled from the list that results from the group_split() (n.b. that group_split() is experimental). With tidyr::nest()* and ungroup() you're inside a tibble the whole time.

That said, if you think this is incorrect, now would be a good time to file an issue, since the dev version is using vctrs_list_of.

library(dplyr)
library(tidyr)
library(purrr)

by_homeworld <- starwars %>%
  filter(homeworld %in% c("Coruscant", "Alderaan")) %>%
  select(homeworld, name) %>%
  arrange(desc(homeworld)) %>%
  group_by(homeworld)

reg_split <- starwars %>%
  filter(homeworld %in% c("Coruscant", "Alderaan")) %>%
  select(homeworld, name) %>%
  split(.$homeworld)

class(reg_split)
#> [1] "list"

group_split <- by_homeworld %>%
  group_split()

class(group_split)
#> [1] "vctrs_list_of" "vctrs_vctr"

^{Created on 2020-03-03 by the reprex package (v0.3.0.9001)}

* You use tidyr::nest() in the code—I think you meant tidyr::nest() where you said purrr::nest() in the question.

dromano · March 3, 2020, 7:20pm

If you add a slice() operation to your dplyr version, you'll likely see the sorting change:

starwars %>%
  filter(homeworld %in% c("Coruscant", "Alderaan")) %>%
  arrange(desc(homeworld)) %>%
  select(homeworld, name) %>%
  group_by(homeworld) %>% 
  slice(1)

This is the case with SQL in general -- sorting is unpredictable unless requested explicitly -- and probably happens in dplyr for the same reason, namely (as I undertand it), optimizing performance is prioritized over maintaining a fixed ordering of rows. (In relational databases, rows and columns can be conceptualized as 'set' attributes of a table, where 'set' is understood in the mathematical sense.)

MikeKSmith · March 4, 2020, 3:56pm

Thanks for your clarification, Mara. That helps me understand some of the differences I was seeing.

It has left me pondering though whether the sorting by group_by variable is desirable, inevitable or unexpected. My going in position would be that {dplyr} verbs shouldn't do things that we're not explicitly asking them to do. Because there's no arrange call, I'm surprised to see the output in alphabetical order. HOWEVER, I'm sitting here wondering what I would expect the alternative to be... Data ordering for the group_by also seems somewhat unreasonable to assume... especially given @dromano's point.

system · March 25, 2020, 3:56pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.