Sort order discrepancy between arrange and sort

Is there a way to alter how arrange sorts strings? I ran into a discrepancy between arrange and sort, illustrated by the following example.

# Demonstrate sorting discrepancy between `arrange` and `sort`.

# Create sample data.
df <- data.frame(Label = c("bama", "mama", "1000x", "BAnn", "10:00x"), Index = 1:5)

# Sort the rows into ascending label order using `dplyr::arrange`.
df |> dplyr::arrange(Label) |> print()
#>    Label Index
#> 1  1000x     3
#> 2 10:00x     5
#> 3   BAnn     4
#> 4   bama     1
#> 5   mama     2
# Order: 1000x, 10:00x, BAnn, bama, mama.

# Sort the rows into ascending label order using `sort`.
df[sort(df$Label, index.return = TRUE)$ix, ] |> print()
#>    Label Index
#> 5 10:00x     5
#> 3  1000x     3
#> 1   bama     1
#> 4   BAnn     4
#> 2   mama     2
# Order: 10:00x, 1000x, bama, BAnn, mama.

Created on 2024-06-26 with reprex v2.1.0

The context of the original problem involves reading data from a spreadsheet and presenting results in the same order as the spreadsheet rows. Both Google Sheets and LibreOffice Calc use the same sort order that sort does, making arrange the outlier. I'm asking whether there is a setting somewhere (option to arrange, global setting for dplyr) that would convince it to use the same sorting rules used by sort.

I think this would be a good feature request over at the dplyr github issues page.
sort() is I think relying on xtfrm for sort order; I was curious so I searched the dplyr source code for mentions (or discussions) of xtfrm, interestingly I found only one source inclusion; specifically it is the meat and potatoes for arrange reversing a sort order, as -xtfrm() is used within the desc() helper function.

Therefore you can throw in xtfrm() to wrap your variables and it will work the same I think.

df |> dplyr::arrange(xtfrm(Label)) |> print()

or the slightly less esoteric

df |> dplyr::arrange(-desc(Label)) |> print() # i.e. Not descending.

Like I say an elegant option style solution would need to come from a change in the dplyr implementation. you could in the short term make your own arrange function that wraps it ; my code is rather ugly ...

# l_arrange ( for legacy_arrange based on xtfrm)
l_arrange <- function (.data, ..., .by_group = FALSE) {
  require(rlang)
  dots <- enquos(...)
  
  altered_dots <- map(dots,\(q){
    expr <- quo_get_expr(q)
    env <- quo_get_env(q)
    new_expr <- quo(xtfrm(!!expr))
    
    new_quosure(new_expr, env)
  })
  dplyr::arrange(.data = .data,
                 !!!(altered_dots),.by_group)
}

df |> l_arrange(Label) |> print()
2 Likes

Thanks! Both xtfrm and -desc appear to work fine in my application (and are less clunky than using sort). I've adopted -desc, although it frankly shocks me that saying "not descending order" produces a different result than "ascending order". I'll take a look at the GitHub issues page and maybe raise this there.

I raised this on GitHub and got a quick response. Apparently the problem is that arrange uses locale = C by default. Adding the option .locale = "en" to arrange produces the desired sort order.

The person who responded to my GitHub ticket opened a new one regarding the fact that desc apparently does not default to locale = C. Depending on how that gets resolved, the -desc() trick may stop working at some point.

Very interesting. Thanks for the info on that