Sorting column that is a character type but has numbers in it.

Just1234 · November 24, 2023, 9:20pm

Hello, can you please help me sorting the column?

I have continuous variable cut in bands, it often has a scientific notation (at least that's what I see in ggplot)

col1 <- c(
    "[-2,-1)", "[-1,412.5)", "[412.5,1188)", "[1188,1244)", "[1244,1556)",
    "[1556,1628)", "[1628,1631)", "[1631,1775)", "[1775,1834)", "[1834,1950)",
    "[5438,5.729e+04)")
col2 <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K")
count <- c(1,1,1,1,1,1,1,1,1,1,1)
my_dataframe <- data.frame(col1, col2, count)


soted_df<- my_dataframe %>%
            group_by(col1) %>%
            mutate(Count_tot= sum(Count))  %>%
            mutate(col1 = as.factor(col1)) %>%
            arrange(gtools::mixedsort(col1, decreasing = FALSE))

After using this code, my banded labels are still not sorted well. This is what I see in my ggplot. Im not sure why using the unique(df$col1) gives OK values and the ggplot uses scientific notation.

Thanks

FJCC · November 24, 2023, 11:00pm

Your values in col1 look like they were generated by the cut() function or some similar function. If so, they should be factor levels to begin with and they will sort correctly using the arrange() function. Here is an example of that.

library(dplyr)

my_dataframe <- data.frame(A= c(5,8,1,4,99999,8,2,10,3,3,7,8,2),
                 Count = 1)
my_dataframe$col1 <- cut(my_dataframe$A, breaks = c(0,3,6,10,1e5))
my_dataframe
#>        A Count       col1
#> 1      5     1      (3,6]
#> 2      8     1     (6,10]
#> 3      1     1      (0,3]
#> 4      4     1      (3,6]
#> 5  99999     1 (10,1e+05]
#> 6      8     1     (6,10]
#> 7      2     1      (0,3]
#> 8     10     1     (6,10]
#> 9      3     1      (0,3]
#> 10     3     1      (0,3]
#> 11     7     1     (6,10]
#> 12     8     1     (6,10]
#> 13     2     1      (0,3]
class(my_dataframe$col1)
#> [1] "factor"

soted_df<- my_dataframe %>%
  group_by(col1) %>%
  mutate(Count_tot= sum(Count))  %>%
  #mutate(col1 = as.factor(col1)) %>%
  arrange(col1, decreasing = FALSE)
soted_df
#> # A tibble: 13 × 4
#> # Groups:   col1 [4]
#>        A Count col1       Count_tot
#>    <dbl> <dbl> <fct>          <dbl>
#>  1     1     1 (0,3]              5
#>  2     2     1 (0,3]              5
#>  3     3     1 (0,3]              5
#>  4     3     1 (0,3]              5
#>  5     2     1 (0,3]              5
#>  6     5     1 (3,6]              2
#>  7     4     1 (3,6]              2
#>  8     8     1 (6,10]             5
#>  9     8     1 (6,10]             5
#> 10    10     1 (6,10]             5
#> 11     7     1 (6,10]             5
#> 12     8     1 (6,10]             5
#> 13 99999     1 (10,1e+05]         1

^{Created on 2023-11-24 with reprex v2.0.2}
How was your col1 generated? Is it really character values or is it a factor?

technocrat · November 25, 2023, 1:15am

gtools::mixedsort() does what it's supposed to

library(gtools)
col1 <- c(
     "[-2,-1)", "[-1,412.5)", "[412.5,1188)", "[1188,1244)", "[1244,1556)",
     "[1556,1628)", "[1628,1631)", "[1631,1775)", "[1775,1834)", "[1834,1950)",
     "[5438,5.729e+04)")

# help(mixedsort) examples show this is expected behavior
(with_e <- mixedsort(col1))
#>  [1] "[-2,-1)"          "[-1,412.5)"       "[412.5,1188)"     "[1188,1244)"     
#>  [5] "[1244,1556)"      "[1556,1628)"      "[1628,1631)"      "[1631,1775)"     
#>  [9] "[1775,1834)"      "[1834,1950)"      "[5438,5.729e+04)"

Although, mixedsort() has a scientific argument

scientific logical. Should exponential notation be allowed for numeric values

it doesn't change this output.

So, col1 starts out as typeof character and ends up as typeof charactor and therefore can be modified with regex tools to convert the scientific notation. Here's an outline with some partial code. Come back if you need help implementing this.

Determine which elements of col1 have scientific notation

has_e <- function(x) which(isTRUE(grepl("e",x)))
`# need to loop if has_e() returns more than one index
attributes(which(sapply(col1,has_e) == 1))$names

save out delimiters

get_delims <- function(x){
  delims = "^(.).*(.)$"
  delims = unlist(regmatches(x,regexec(delims,x)))[2:3]
  return(delims)
  }

pick apart the string left over after the delimiters removed,

strip_delims <- function(x) {
  return(substr(x, 2, nchar(x) - 1))
}

There is now the string "5438,5.729e+04", which needs to be split on , and each part tested for an e

components = strsplit(x,",")
left =  components[[1]][1]
right = components[[1]][2]
split_exp = unlist(strsplit(right,"e"))
re_exp = as.numeric(paste0("1","e",split_exp[2]))
#> 10000

Convert forepart to numeric and multiply

as.numeric(left) * re_exp

This, in turn, needs to be converted to character and tested for scientific notation
If so,

format(x,scientific = FALSE)

That returns a string, that can be re-united with left with `paste()1
Finally, reunite that with the opening and closing delimiters

With

options(scipen=999)

it's possible to suppress scientific notation displays, but mixedsort() doesn't seem to respect that.

Just1234 · November 25, 2023, 8:53am

You are right, I am using the cut() function to generate these binned values and it comes out as a factor. However, before doing my final summary tables, that I want to plot, I still have to do quite a bit other manipulations to the dataset which makes this column to become a character. Hence I was using the as.factor() to convert it back to a factor. The arrange() alone or together with the as.factor() does nothing basically.

EDIT:
So, I managed to get it mostly sorted converting the variable into a factor at earlier stages. either arrange() or mixedsort() works fine. There is a strange behavior happening, that one level is above the other.

soted_df<- my_dataframe %>%
  group_by(col1) %>%
  mutate(Count_tot= sum(Count))  %>%
  arrange(col1, decreasing = FALSE)

and the result is

sorted_v <- c(soted_df$col1)

27 Levels: [3258,3521) [-2,-1) [-1,412.5) [412.5,1188) [1188,1244) [1244,1556) [1556,1628) [1628,1631) [1631,1775) [1775,1834)

I am puzzled, how on earth 3258 is lower than -2. Everything else works fine.

Thanks

Just1234 · November 25, 2023, 10:09am

Thank you, this is looking extremely sophisticated. Ill try to sort it with arrange() and mixedsort() and then get to this approach if i fail the first two.

technocrat · November 25, 2023, 10:31am

It's more brute force than sophisticated—intervals turn out to be tricky little devils as character strings.

system · January 6, 2024, 10:32am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.