Why does group_indices use alphabetical ordering?

kylevoyto · February 21, 2018, 1:37pm

I'd like to number each group in a data frame so that the groups are ordered according to the order they appear in the data frame. This is the code that I have so far:

library(tibble)
library(dplyr)

df <- tibble(
  category = c("a", "b", "c", "c"),
  value = c(7, 1, 4, 2)
)

df <- df %>%
  group_by(category) %>%
  mutate(mean_value = mean(value)) %>%
  arrange(mean_value, category) %>%
  ungroup()

df %>% mutate(id = group_indices(., category))
#> # A tibble: 4 x 4
#>   category value mean_value    id
#>   <chr>    <dbl>      <dbl> <int>
#> 1 b         1.00       1.00     2
#> 2 c         4.00       3.00     3
#> 3 c         2.00       3.00     3
#> 4 a         7.00       7.00     1

I'd like the id variable to be ordered like this:

#> # A tibble: 4 x 4
#>   category value mean_value    id
#>   <chr>    <dbl>      <dbl> <int>
#> 1 b         1.00       1.00     1
#> 2 c         4.00       3.00     2
#> 3 c         2.00       3.00     2
#> 4 a         7.00       7.00     3

I ordered the data frame according to the criteria that I wanted to use (mean_value), and now I'd like to number the groups to align with category.

Why does the group_indices function order alphabetically by default? Is there a simple way for me to achieve my goal?

mara · February 21, 2018, 2:24pm

Hi @kylevoyto,

FYI, there's a related issue open in the dplyr repo:

github.com/tidyverse/dplyr

group_by is sorting and does not maintain original order

opened 11:35AM - 02 Jan 18 UTC

closed 10:23AM - 30 May 18 UTC

ghaarsma

performance

It seems that dplyr's group_by does sort, at least for character, integer and nu…meric. It does maintain order for factor. Tested with dplyr 0.7.4: ```R set.seed(4) char <- sample(LETTERS[1:20],40,replace = TRUE) int <- sample(1L:20L,40,replace = TRUE) double <- sample(runif(20),40,replace = TRUE) x <- tibble(char,int,double,fact=factor(char,levels = unique(char))) # All group_by results are sorted except the factor group_by(x,char) %>% do(.[1,'char']) group_by(x,int) %>% do(.[1,'int']) group_by(x,double) %>% do(.[1,'double']) group_by(x,fact) %>% do(.[1,'fact']) # If group_by does not sort, the first indices should contain the first element (zero-based) # This is only true for the factor g <- group_by(x,char);attr(g,'indices')[[1]] g <- group_by(x,int);attr(g,'indices')[[1]] g <- group_by(x,double);attr(g,'indices')[[1]] g <- group_by(x,fact);attr(g,'indices')[[1]] ``` Not sure why group_by is sorting. It seems like it's unnecessary including the additional computational effort. This would make the behavior more like the base function ```unique``` or dplyr function ```distinct```, which does not sort either. Sometimes sorting is nice, so perhaps it could be an option. If the behavior remains as is, perhaps we can add a sorting note to the group_by documentation. See for older discussion (but with incorrect finding/conclusion) #2159

mishabalyasin · February 21, 2018, 2:33pm

I don't know if it can be considered simple, but I would write my own function for that:

respect_sort <- function(df, category = "category", id = "id"){
  df[[id]] <- NA
  lvls <- df[[category]] %>% unique()
  mapping <- seq(1:length(lvls))
  purrr::walk2(lvls, mapping, function(x, y){
    df[[id]][df[[category]] == x] <<- y
  })
  df
}

> df %>% respect_sort()
# A tibble: 4 x 4
  category value mean_value    id
  <chr>    <dbl>      <dbl> <int>
1 b         1.00       1.00     1
2 c         4.00       3.00     2
3 c         2.00       3.00     2
4 a         7.00       7.00     3

It's a little hacky, but it does what you want.

Frank · February 21, 2018, 4:51pm

You can wrap group_indices in another function.

grpid = function(x) match(x, unique(x))
df %>% mutate(id = group_indices(., category) %>% grpid)

# A tibble: 4 x 4
  category value mean_value    id
     <chr> <dbl>      <dbl> <int>
1        b     1          1     1
2        c     4          3     2
3        c     2          3     2
4        a     7          7     3

For what it's worth, the result you want is provided by default with data.table:

library(data.table)
DT = data.table(df)

DT[, id := .GRP, by=.(category)][]

   category value mean_value id
1:        b     1          1  1
2:        c     4          3  2
3:        c     2          3  2
4:        a     7          7  3

cderv · February 21, 2018, 6:42pm

From @mara links to the issue, we understand that for factors it is ok. So you can do this :

library(tibble)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  category = c("a", "b", "c", "c"),
  value = c(7, 1, 4, 2)
)

df <- df %>%
  group_by(category) %>%
  mutate(mean_value = mean(value)) %>%
  arrange(mean_value, category) %>%
  ungroup()

df %>%
  mutate(id = group_indices(., factor(category, levels = unique(category))))
#> # A tibble: 4 x 4
#>   category value mean_value    id
#>   <chr>    <dbl>      <dbl> <int>
#> 1 b         1.00       1.00     1
#> 2 c         4.00       3.00     2
#> 3 c         2.00       3.00     2
#> 4 a         7.00       7.00     3

Created on 2018-02-21 by the reprex package (v0.2.0).