Dplyr order by values, not labels

Slavek · September 8, 2023, 3:08pm

Hi, I have this simple df with recoded variables:

source <- data.frame(
  stringsAsFactors = FALSE,
  Mileage = c(600, 10, 50, 100, 1200, 500, 1000, 2000, 1500, 300, 750),
  Model = c("aaa","aaa","aaa","aaa",
            "aaa","bbb","bbb","bbb","bbb","bbb","bbb"),
  RoTotal = c(0.632652484490701,
              0.185310835928467,0.577469486919075,0.68712277811571,
              0.886954372144862,0.47168592416002,0.488694716005684,
              0.863730894475365,0.366193956213799,0.695052242862401,
              0.714609567675162)
)


library(dplyr)
source <- source %>% mutate(
  Mileage.Bucket = case_when(
    Mileage >= 0 & Mileage <=100 ~ 1,
    Mileage > 100 & Mileage <=500 ~ 2,
    Mileage > 500 & Mileage <=1000 ~ 3,
    Mileage > 1000 & Mileage <=1500 ~ 4,
    Mileage > 1500 ~ 5))

source$Mileage.Bucket <- factor(as.numeric(source$Mileage.Bucket),
                                                  levels = c(1, 2, 3, 4, 5), 
                                                  labels = c("0-100", "100-500", "500-1000", "1000-1500", "More than 1500")) 

table(source$Mileage.Bucket)

I used numbers instead of texts in the mutate deliberately to have values in order (0-100, 100-500, 500-1000, 1000-1500 More than 1500).
Unfortunately, when I run tables using dplyr, the order of Mileage.Bucket is treated as text (I think) and, as a result, I have illogical order of Mileage buckets.

result.table <- source %>%
  bind_rows(mutate(.data = ., Mileage.Bucket = "Total")) %>%
  group_by(Model, Mileage.Bucket) %>%
  summarise_at(.vars = vars(ends_with(match = "RoTotal")),.funs = list(Aver = ~mean(.,na.rm=TRUE), Count = ~sum(!is.na(.)))) 
result.table

Is there a way of ordering Mileage.Bucket by values ( so 1, 2, 3, 4 and 5) without using this manual arrange command?

arrange(match(Mileage.Bucket, c("0-100", "100-500", "500-1000", "More than 1500")))

I am lack of further ideas...

AlexisW · September 8, 2023, 9:13pm

To explain the behavior, note that dplyr::arrange() does handle factors correctly (keeping the right order). However, when your run bind_rows(Mileage.Bucket = "Total"), you are adding text, the whole column becomes text and looses its factor characteristics:

library(dplyr)

source <- data.frame(
  stringsAsFactors = FALSE,
  Mileage = c(600, 10, 50, 100, 1200, 500, 1000, 2000, 1500, 300, 750),
  Model = c("aaa","aaa","aaa","aaa",
            "aaa","bbb","bbb","bbb","bbb","bbb","bbb"),
  RoTotal = c(0.632652484490701,
              0.185310835928467,0.577469486919075,0.68712277811571,
              0.886954372144862,0.47168592416002,0.488694716005684,
              0.863730894475365,0.366193956213799,0.695052242862401,
              0.714609567675162)
)

source <- source %>% mutate(
  Mileage.Bucket = case_when(
    Mileage >= 0 & Mileage <=100 ~ 1,
    Mileage > 100 & Mileage <=500 ~ 2,
    Mileage > 500 & Mileage <=1000 ~ 3,
    Mileage > 1000 & Mileage <=1500 ~ 4,
    Mileage > 1500 ~ 5))

source$Mileage.Bucket <- factor(as.numeric(source$Mileage.Bucket),
                                levels = c(1, 2, 3, 4, 5), 
                                labels = c("0-100", "100-500", "500-1000", "1000-1500", "More than 1500")) 

# correct ordering, note that the column is <fct>
source %>%
  as_tibble() %>%
  arrange(Mileage.Bucket)
#> # A tibble: 11 × 4
#>    Mileage Model RoTotal Mileage.Bucket
#>      <dbl> <chr>   <dbl> <fct>         
#>  1      10 aaa     0.185 0-100         
#>  2      50 aaa     0.577 0-100         
#>  3     100 aaa     0.687 0-100         
#>  4     500 bbb     0.472 100-500       
#>  5     300 bbb     0.695 100-500       
#>  6     600 aaa     0.633 500-1000      
#>  7    1000 bbb     0.489 500-1000      
#>  8     750 bbb     0.715 500-1000      
#>  9    1200 aaa     0.887 1000-1500     
#> 10    1500 bbb     0.366 1000-1500     
#> 11    2000 bbb     0.864 More than 1500

# incorrect ordering, note that the column is now <chr>
source %>%
  bind_rows(mutate(.data = ., Mileage.Bucket = "Total")) %>%
  as_tibble() %>%
  arrange()
#> # A tibble: 22 × 4
#>    Mileage Model RoTotal Mileage.Bucket
#>      <dbl> <chr>   <dbl> <chr>         
#>  1     600 aaa     0.633 500-1000      
#>  2      10 aaa     0.185 0-100         
#>  3      50 aaa     0.577 0-100         
#>  4     100 aaa     0.687 0-100         
#>  5    1200 aaa     0.887 1000-1500     
#>  6     500 bbb     0.472 100-500       
#>  7    1000 bbb     0.489 500-1000      
#>  8    2000 bbb     0.864 More than 1500
#>  9    1500 bbb     0.366 1000-1500     
#> 10     300 bbb     0.695 100-500       
#> # ℹ 12 more rows

^{Created on 2023-09-08 with reprex v2.0.2}

So the question is how to add "Total" to a factor while preserving it? It's not a very standard use of data.frames, I would say the easiest is simply to make sure "Total" is a factor when adding it:

source %>%
  bind_rows(mutate(.data = ., Mileage.Bucket = factor("Total"))) %>%
  group_by(Model, Mileage.Bucket) %>%
  summarise_at(.vars = vars(ends_with(match = "RoTotal")),
               .funs = list(Aver = ~mean(.,na.rm=TRUE), 
                            Count = ~sum(!is.na(.)))) %>%
  arrange(Mileage.Bucket)
#> # A tibble: 9 × 4
#> # Groups:   Model [2]
#>   Model Mileage.Bucket  Aver Count
#>   <chr> <fct>          <dbl> <int>
#> 1 aaa   0-100          0.483     3
#> 2 bbb   100-500        0.583     2
#> 3 aaa   500-1000       0.633     1
#> 4 bbb   500-1000       0.602     2
#> 5 aaa   1000-1500      0.887     1
#> 6 bbb   1000-1500      0.366     1
#> 7 bbb   More than 1500 0.864     1
#> 8 aaa   Total          0.594     5
#> 9 bbb   Total          0.600     6

system · September 15, 2023, 9:14pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.