I'd like to number each group in a data frame so that the groups are ordered according to the order they appear in the data frame. This is the code that I have so far:
library(tibble)
library(dplyr)
df <- tibble(
category = c("a", "b", "c", "c"),
value = c(7, 1, 4, 2)
)
df <- df %>%
group_by(category) %>%
mutate(mean_value = mean(value)) %>%
arrange(mean_value, category) %>%
ungroup()
df %>% mutate(id = group_indices(., category))
#> # A tibble: 4 x 4
#> category value mean_value id
#> <chr> <dbl> <dbl> <int>
#> 1 b 1.00 1.00 2
#> 2 c 4.00 3.00 3
#> 3 c 2.00 3.00 3
#> 4 a 7.00 7.00 1
I'd like the id variable to be ordered like this:
#> # A tibble: 4 x 4
#> category value mean_value id
#> <chr> <dbl> <dbl> <int>
#> 1 b 1.00 1.00 1
#> 2 c 4.00 3.00 2
#> 3 c 2.00 3.00 2
#> 4 a 7.00 7.00 3
I ordered the data frame according to the criteria that I wanted to use (mean_value), and now I'd like to number the groups to align with category.
Why does the group_indices function order alphabetically by default? Is there a simple way for me to achieve my goal?
3 Likes
mara
February 21, 2018, 2:24pm
2
Hi @kylevoyto ,
FYI, there's a related issue open in the dplyr repo:
opened 11:35AM - 02 Jan 18 UTC
closed 10:23AM - 30 May 18 UTC
performance
It seems that dplyr's group_by does sort, at least for character, integer and nuā¦ meric. It does maintain order for factor. Tested with dplyr 0.7.4:
```R
set.seed(4)
char <- sample(LETTERS[1:20],40,replace = TRUE)
int <- sample(1L:20L,40,replace = TRUE)
double <- sample(runif(20),40,replace = TRUE)
x <- tibble(char,int,double,fact=factor(char,levels = unique(char)))
# All group_by results are sorted except the factor
group_by(x,char) %>% do(.[1,'char'])
group_by(x,int) %>% do(.[1,'int'])
group_by(x,double) %>% do(.[1,'double'])
group_by(x,fact) %>% do(.[1,'fact'])
# If group_by does not sort, the first indices should contain the first element (zero-based)
# This is only true for the factor
g <- group_by(x,char);attr(g,'indices')[[1]]
g <- group_by(x,int);attr(g,'indices')[[1]]
g <- group_by(x,double);attr(g,'indices')[[1]]
g <- group_by(x,fact);attr(g,'indices')[[1]]
```
Not sure why group_by is sorting. It seems like it's unnecessary including the additional computational effort. This would make the behavior more like the base function ```unique``` or dplyr function ```distinct```, which does not sort either.
Sometimes sorting is nice, so perhaps it could be an option. If the behavior remains as is, perhaps we can add a sorting note to the group_by documentation.
See for older discussion (but with incorrect finding/conclusion) #2159
3 Likes
I don't know if it can be considered simple, but I would write my own function for that:
respect_sort <- function(df, category = "category", id = "id"){
df[[id]] <- NA
lvls <- df[[category]] %>% unique()
mapping <- seq(1:length(lvls))
purrr::walk2(lvls, mapping, function(x, y){
df[[id]][df[[category]] == x] <<- y
})
df
}
> df %>% respect_sort()
# A tibble: 4 x 4
category value mean_value id
<chr> <dbl> <dbl> <int>
1 b 1.00 1.00 1
2 c 4.00 3.00 2
3 c 2.00 3.00 2
4 a 7.00 7.00 3
It's a little hacky, but it does what you want.
1 Like
Frank
February 21, 2018, 4:51pm
4
You can wrap group_indices in another function.
grpid = function(x) match(x, unique(x))
df %>% mutate(id = group_indices(., category) %>% grpid)
# A tibble: 4 x 4
category value mean_value id
<chr> <dbl> <dbl> <int>
1 b 1 1 1
2 c 4 3 2
3 c 2 3 2
4 a 7 7 3
For what it's worth, the result you want is provided by default with data.table:
library(data.table)
DT = data.table(df)
DT[, id := .GRP, by=.(category)][]
category value mean_value id
1: b 1 1 1
2: c 4 3 2
3: c 2 3 2
4: a 7 7 3
3 Likes
cderv
February 21, 2018, 6:42pm
5
From @mara links to the issue, we understand that for factors it is ok. So you can do this :
library(tibble)
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
category = c("a", "b", "c", "c"),
value = c(7, 1, 4, 2)
)
df <- df %>%
group_by(category) %>%
mutate(mean_value = mean(value)) %>%
arrange(mean_value, category) %>%
ungroup()
df %>%
mutate(id = group_indices(., factor(category, levels = unique(category))))
#> # A tibble: 4 x 4
#> category value mean_value id
#> <chr> <dbl> <dbl> <int>
#> 1 b 1.00 1.00 1
#> 2 c 4.00 3.00 2
#> 3 c 2.00 3.00 2
#> 4 a 7.00 7.00 3
Created on 2018-02-21 by the reprex package (v0.2.0).
5 Likes