Sorting column that is a character type but has numbers in it.

Hello, can you please help me sorting the column?

I have continuous variable cut in bands, it often has a scientific notation (at least that's what I see in ggplot)

col1 <- c(
    "[-2,-1)", "[-1,412.5)", "[412.5,1188)", "[1188,1244)", "[1244,1556)",
    "[1556,1628)", "[1628,1631)", "[1631,1775)", "[1775,1834)", "[1834,1950)",
    "[5438,5.729e+04)")
col2 <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K")
count <- c(1,1,1,1,1,1,1,1,1,1,1)
my_dataframe <- data.frame(col1, col2, count)


soted_df<- my_dataframe %>%
            group_by(col1) %>%
            mutate(Count_tot= sum(Count))  %>%
            mutate(col1 = as.factor(col1)) %>%
            arrange(gtools::mixedsort(col1, decreasing = FALSE))

After using this code, my banded labels are still not sorted well. This is what I see in my ggplot. Im not sure why using the unique(df$col1) gives OK values and the ggplot uses scientific notation.

Thanks

Your values in col1 look like they were generated by the cut() function or some similar function. If so, they should be factor levels to begin with and they will sort correctly using the arrange() function. Here is an example of that.

library(dplyr)

my_dataframe <- data.frame(A= c(5,8,1,4,99999,8,2,10,3,3,7,8,2),
                 Count = 1)
my_dataframe$col1 <- cut(my_dataframe$A, breaks = c(0,3,6,10,1e5))
my_dataframe
#>        A Count       col1
#> 1      5     1      (3,6]
#> 2      8     1     (6,10]
#> 3      1     1      (0,3]
#> 4      4     1      (3,6]
#> 5  99999     1 (10,1e+05]
#> 6      8     1     (6,10]
#> 7      2     1      (0,3]
#> 8     10     1     (6,10]
#> 9      3     1      (0,3]
#> 10     3     1      (0,3]
#> 11     7     1     (6,10]
#> 12     8     1     (6,10]
#> 13     2     1      (0,3]
class(my_dataframe$col1)
#> [1] "factor"

soted_df<- my_dataframe %>%
  group_by(col1) %>%
  mutate(Count_tot= sum(Count))  %>%
  #mutate(col1 = as.factor(col1)) %>%
  arrange(col1, decreasing = FALSE)
soted_df
#> # A tibble: 13 Ă— 4
#> # Groups:   col1 [4]
#>        A Count col1       Count_tot
#>    <dbl> <dbl> <fct>          <dbl>
#>  1     1     1 (0,3]              5
#>  2     2     1 (0,3]              5
#>  3     3     1 (0,3]              5
#>  4     3     1 (0,3]              5
#>  5     2     1 (0,3]              5
#>  6     5     1 (3,6]              2
#>  7     4     1 (3,6]              2
#>  8     8     1 (6,10]             5
#>  9     8     1 (6,10]             5
#> 10    10     1 (6,10]             5
#> 11     7     1 (6,10]             5
#> 12     8     1 (6,10]             5
#> 13 99999     1 (10,1e+05]         1

Created on 2023-11-24 with reprex v2.0.2
How was your col1 generated? Is it really character values or is it a factor?

gtools::mixedsort() does what it's supposed to

library(gtools)
col1 <- c(
     "[-2,-1)", "[-1,412.5)", "[412.5,1188)", "[1188,1244)", "[1244,1556)",
     "[1556,1628)", "[1628,1631)", "[1631,1775)", "[1775,1834)", "[1834,1950)",
     "[5438,5.729e+04)")

# help(mixedsort) examples show this is expected behavior
(with_e <- mixedsort(col1))
#>  [1] "[-2,-1)"          "[-1,412.5)"       "[412.5,1188)"     "[1188,1244)"     
#>  [5] "[1244,1556)"      "[1556,1628)"      "[1628,1631)"      "[1631,1775)"     
#>  [9] "[1775,1834)"      "[1834,1950)"      "[5438,5.729e+04)"

Although, mixedsort() has a scientific argument

scientific logical. Should exponential notation be allowed for numeric values

it doesn't change this output.

So, col1 starts out as typeof character and ends up as typeof charactor and therefore can be modified with regex tools to convert the scientific notation. Here's an outline with some partial code. Come back if you need help implementing this.

  1. Determine which elements of col1 have scientific notation
has_e <- function(x) which(isTRUE(grepl("e",x)))
`# need to loop if has_e() returns more than one index
attributes(which(sapply(col1,has_e) == 1))$names
  1. save out delimiters
get_delims <- function(x){
  delims = "^(.).*(.)$"
  delims = unlist(regmatches(x,regexec(delims,x)))[2:3]
  return(delims)
  }
  1. pick apart the string left over after the delimiters removed,
strip_delims <- function(x) {
  return(substr(x, 2, nchar(x) - 1))
}
  1. There is now the string "5438,5.729e+04", which needs to be split on , and each part tested for an e
components = strsplit(x,",")
left =  components[[1]][1]
right = components[[1]][2]
split_exp = unlist(strsplit(right,"e"))
re_exp = as.numeric(paste0("1","e",split_exp[2]))
#> 10000
  1. Convert forepart to numeric and multiply
as.numeric(left) * re_exp
  1. This, in turn, needs to be converted to character and tested for scientific notation

  2. If so,

format(x,scientific = FALSE)
  1. That returns a string, that can be re-united with left with `paste()1

  2. Finally, reunite that with the opening and closing delimiters

With

options(scipen=999)

it's possible to suppress scientific notation displays, but mixedsort() doesn't seem to respect that.

1 Like

You are right, I am using the cut() function to generate these binned values and it comes out as a factor. However, before doing my final summary tables, that I want to plot, I still have to do quite a bit other manipulations to the dataset which makes this column to become a character. Hence I was using the as.factor() to convert it back to a factor. The arrange() alone or together with the as.factor() does nothing basically.

EDIT:
So, I managed to get it mostly sorted converting the variable into a factor at earlier stages. either arrange() or mixedsort() works fine. There is a strange behavior happening, that one level is above the other.

soted_df<- my_dataframe %>%
  group_by(col1) %>%
  mutate(Count_tot= sum(Count))  %>%
  arrange(col1, decreasing = FALSE)

and the result is

sorted_v <- c(soted_df$col1)

27 Levels: [3258,3521) [-2,-1) [-1,412.5) [412.5,1188) [1188,1244) [1244,1556) [1556,1628) [1628,1631) [1631,1775) [1775,1834)

I am puzzled, how on earth 3258 is lower than -2. Everything else works fine.

Thanks

Thank you, this is looking extremely sophisticated. Ill try to sort it with arrange() and mixedsort() and then get to this approach if i fail the first two.

1 Like

It's more brute force than sophisticated—intervals turn out to be tricky little devils as character strings.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.