Sort order discrepancy between arrange and sort

prubin · June 26, 2024, 4:21pm

Is there a way to alter how arrange sorts strings? I ran into a discrepancy between arrange and sort, illustrated by the following example.

# Demonstrate sorting discrepancy between `arrange` and `sort`.

# Create sample data.
df <- data.frame(Label = c("bama", "mama", "1000x", "BAnn", "10:00x"), Index = 1:5)

# Sort the rows into ascending label order using `dplyr::arrange`.
df |> dplyr::arrange(Label) |> print()
#>    Label Index
#> 1  1000x     3
#> 2 10:00x     5
#> 3   BAnn     4
#> 4   bama     1
#> 5   mama     2

# Order: 1000x, 10:00x, BAnn, bama, mama.

# Sort the rows into ascending label order using `sort`.
df[sort(df$Label, index.return = TRUE)$ix, ] |> print()
#>    Label Index
#> 5 10:00x     5
#> 3  1000x     3
#> 1   bama     1
#> 4   BAnn     4
#> 2   mama     2

# Order: 10:00x, 1000x, bama, BAnn, mama.

^{Created on 2024-06-26 with reprex v2.1.0}

The context of the original problem involves reading data from a spreadsheet and presenting results in the same order as the spreadsheet rows. Both Google Sheets and LibreOffice Calc use the same sort order that sort does, making arrange the outlier. I'm asking whether there is a setting somewhere (option to arrange, global setting for dplyr) that would convince it to use the same sorting rules used by sort.

nirgrahamuk · June 26, 2024, 5:06pm

I think this would be a good feature request over at the dplyr github issues page.
sort() is I think relying on xtfrm for sort order; I was curious so I searched the dplyr source code for mentions (or discussions) of xtfrm, interestingly I found only one source inclusion; specifically it is the meat and potatoes for arrange reversing a sort order, as -xtfrm() is used within the desc() helper function.

Therefore you can throw in xtfrm() to wrap your variables and it will work the same I think.

df |> dplyr::arrange(xtfrm(Label)) |> print()

or the slightly less esoteric

df |> dplyr::arrange(-desc(Label)) |> print() # i.e. Not descending.

Like I say an elegant option style solution would need to come from a change in the dplyr implementation. you could in the short term make your own arrange function that wraps it ; my code is rather ugly ...

# l_arrange ( for legacy_arrange based on xtfrm)
l_arrange <- function (.data, ..., .by_group = FALSE) {
  require(rlang)
  dots <- enquos(...)
  
  altered_dots <- map(dots,\(q){
    expr <- quo_get_expr(q)
    env <- quo_get_env(q)
    new_expr <- quo(xtfrm(!!expr))
    
    new_quosure(new_expr, env)
  })
  dplyr::arrange(.data = .data,
                 !!!(altered_dots),.by_group)
}

df |> l_arrange(Label) |> print()

prubin · June 26, 2024, 6:27pm

Thanks! Both xtfrm and -desc appear to work fine in my application (and are less clunky than using sort). I've adopted -desc, although it frankly shocks me that saying "not descending order" produces a different result than "ascending order". I'll take a look at the GitHub issues page and maybe raise this there.

prubin · June 26, 2024, 8:28pm

I raised this on GitHub and got a quick response. Apparently the problem is that arrange uses locale = C by default. Adding the option .locale = "en" to arrange produces the desired sort order.

The person who responded to my GitHub ticket opened a new one regarding the fact that desc apparently does not default to locale = C. Depending on how that gets resolved, the -desc() trick may stop working at some point.

nirgrahamuk · June 27, 2024, 1:04am

Very interesting. Thanks for the info on that

mithileshbwankhede · July 2, 2024, 4:06am

The discrepancy you observed between dplyr::arrange and sort is due to differences in the default collation settings used by these functions. By default, dplyr::arrange uses locale-specific collation rules, while sort uses the locale's collation rules but can be customized with the method argument.

To make dplyr::arrange use the same sorting rules as sort, you can use the stringr::str_sort function with the locale argument, or use base R functions to customize the sorting.

Here's how you can achieve consistent sorting using dplyr::mutate along with stringr::str_sort:

library(dplyr)
library(stringr)

# Create sample data.
df <- data.frame(Label = c("bama", "mama", "1000x", "BAnn", "10:00x"), Index = 1:5)

# Sort the rows into ascending label order using custom sorting rules.
df |> 
  mutate(Label = factor(Label, levels = str_sort(unique(Label), locale = "C"))) |> 
  arrange(Label) |> 
  print()
#>    Label Index
#> 1 10:00x     5
#> 2  1000x     3
#> 3   bama     1
#> 4   BAnn     4
#> 5   mama     2
# Order: 10:00x, 1000x, bama, BAnn, mama.

In this example, str_sort with locale = "C" is used to sort the factor levels according to the same collation rules as sort. The mutate function is used to convert the Label column into a factor with the sorted levels. Finally, arrange sorts the data frame based on this factor.

If you want to use base R functions to achieve the same result without stringr, you can do the following:

library(dplyr)

# Create sample data.
df <- data.frame(Label = c("bama", "mama", "1000x", "BAnn", "10:00x"), Index = 1:5)

# Sort the rows into ascending label order using custom sorting rules.
df <- df %>%
  mutate(Label = factor(Label, levels = sort(unique(Label), method = "radix"))) %>%
  arrange(Label)

print(df)
#>    Label Index
#> 1 10:00x     5
#> 2  1000x     3
#> 3   bama     1
#> 4   BAnn     4
#> 5   mama     2
# Order: 10:00x, 1000x, bama, BAnn, mama.

In this version, sort with method = "radix" is used to sort the factor levels, ensuring consistent collation with the base R sort function.

By customizing the sorting in this way, you can ensure that dplyr::arrange uses the same sorting rules as sort, aligning with the order used in Google Sheets and LibreOffice Calc.

system · July 9, 2024, 4:07am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.