Arranging and Sorting Columns based on Missing values

GreyMerchant · January 10, 2024, 1:05pm

Hii,

I have a bit of a weird one. I have this reprex below showing a toy version of my df. My actual dataframe looks very different and has a lot more columns.

I want to find a way to arrange my columns by keeping ID first followed by the columns that have the most missing values to least. I then want to sort across all columns simultaneously trying to have the missing values first followed by ascending order for numeric or ascending order for text. How can I do this?

# Set seed for reproducibility
set.seed(123)

# Create ID column
ID <- 1:50

# Create numeric columns with random values and missing data
numeric_col1 <- runif(50, 0, 10)
numeric_col1[sample(1:50, 5)] <- NA

numeric_col2 <- runif(50, 0, 10)
numeric_col2[sample(1:50, 10)] <- NA

numeric_col3 <- runif(50, 0, 10)
numeric_col3[sample(1:50, 15)] <- NA

# Create text columns with random values and missing data
text_col1 <- sample(letters, 50, replace = TRUE)
text_col1[sample(1:50, 5)] <- NA

text_col2 <- sample(letters, 50, replace = TRUE)
text_col2[sample(1:50, 10)] <- NA

text_col3 <- sample(letters, 50, replace = TRUE)
text_col3[sample(1:50, 15)] <- NA

# Create the dataframe
df <- data.frame(ID, numeric_col1, numeric_col2, numeric_col3, text_col1, text_col2, text_col3)

# Display the resulting dataframe
print(df)
#>    ID numeric_col1 numeric_col2 numeric_col3 text_col1 text_col2 text_col3
#> 1   1    2.8757752  8.950453592    3.1982062         s         g         d
#> 2   2    7.8830514  3.744627759    3.0772001         x      <NA>      <NA>
#> 3   3    4.0897692  6.651151946    2.1976763         t         z      <NA>
#> 4   4    8.8301740  0.948406609    3.6948887         o         j         e
#> 5   5    9.4046728  3.839696378    9.8421920         g         x         u
#> 6   6    0.4555650  2.743836446    1.5420230         d         v         y
#> 7   7           NA  8.146400389           NA         a         w         h
#> 8   8    8.9241904  4.485163414    1.4190691      <NA>         z         w
#> 9   9    5.5143501  8.100643530           NA      <NA>         k      <NA>
#> 10 10    4.5661474  8.123895095    6.1925648      <NA>         a         u
#> 11 11    9.5683335  7.943423211           NA         p         y         m
#> 12 12    4.5333416           NA    6.7299909         x         s         r
#> 13 13    6.7757064  7.544751586           NA         v         j         j
#> 14 14    5.7263340           NA    5.2113573         k         u         f
#> 15 15    1.0292468  7.101824014    6.5983845         p         v         g
#> 16 16    8.9982497  0.006247733           NA         t         m         i
#> 17 17    2.4608773  4.753165741    7.8628155         h      <NA>         p
#> 18 18    0.4205953  2.201188852    9.7982192      <NA>         k         q
#> 19 19    3.2792072  3.798165377           NA         d         y         u
#> 20 20    9.5450365  6.127710033    3.1170220         t      <NA>         m
#> 21 21    8.8953932  3.517979092           NA         l         z         u
#> 22 22    6.9280341  1.111354243    0.1046711         v         g         h
#> 23 23    6.4050681  2.436194727    1.8384952         q         y      <NA>
#> 24 24    9.9426978  6.680555874           NA         j         w         g
#> 25 25           NA  4.176467797           NA      <NA>      <NA>         t
#> 26 26    7.0853047  7.881958340           NA         k         t         r
#> 27 27           NA  1.028646443    0.7669117         y         x      <NA>
#> 28 28    5.9414202  4.348927415    2.4572368         w         y         a
#> 29 29    2.8915974           NA    7.3213521         h      <NA>         b
#> 30 30    1.4711365           NA    8.4745317         n      <NA>      <NA>
#> 31 31    9.6302423           NA    4.9752727         u      <NA>      <NA>
#> 32 32           NA  1.750526503           NA         m         w         m
#> 33 33    6.9070528  1.306956916    2.4644899         b      <NA>         x
#> 34 34    7.9546742  6.531019250           NA         k         n         q
#> 35 35    0.2461368           NA    3.8999444         m      <NA>      <NA>
#> 36 36    4.7779597  6.567581280           NA         n         a         r
#> 37 37    7.5845954  3.203732425    2.1689276         f      <NA>         b
#> 38 38           NA  1.876911193    4.4476800         y         j         e
#> 39 39    3.1818101  7.822943013           NA         h         q      <NA>
#> 40 40    2.3162579           NA    5.0229956         l         z      <NA>
#> 41 41    1.4280002  4.667790416    3.5390457         z         q      <NA>
#> 42 42    4.1454634  5.115054599    6.4998516         d         z         j
#> 43 43    4.1372433  5.999889593    3.7471396         m         u         f
#> 44 44    3.6884545  3.328235403           NA         n         g         j
#> 45 45    1.5244475  4.886130337    5.3368795         u         u      <NA>
#> 46 46    1.3880606  9.544738275    7.4033436         p         z      <NA>
#> 47 47    2.3303410           NA    2.2110294         w         i      <NA>
#> 48 48    4.6596245           NA    4.1274612         a         t         p
#> 49 49    2.6597264  9.144381869    2.6568669         h         f         c
#> 50 50    8.5782772           NA    6.2997305         h         r      <NA>

Created on 2024-01-10 by the reprex package (v2.0.1)

FJCC · January 10, 2024, 3:15pm

I think this does the first part of what you want. I'm not sure about the part
" I then want to sort across all columns simultaneously trying to have the missing values first followed by ascending order for numeric or ascending order for text. "
Do you want to sort the columns independently so you lose the correlation to the ID?

set.seed(123)

# Create ID column
ID <- 1:50

# Create numeric columns with random values and missing data
numeric_col1 <- runif(50, 0, 10)
numeric_col1[sample(1:50, 5)] <- NA

numeric_col2 <- runif(50, 0, 10)
numeric_col2[sample(1:50, 10)] <- NA

numeric_col3 <- runif(50, 0, 10)
numeric_col3[sample(1:50, 15)] <- NA

# Create text columns with random values and missing data
text_col1 <- sample(letters, 50, replace = TRUE)
text_col1[sample(1:50, 5)] <- NA

text_col2 <- sample(letters, 50, replace = TRUE)
text_col2[sample(1:50, 10)] <- NA

text_col3 <- sample(letters, 50, replace = TRUE)
text_col3[sample(1:50, 15)] <- NA

# Create the dataframe
df <- data.frame(ID, numeric_col1, numeric_col2, numeric_col3, text_col1, text_col2, text_col3)
NAcounts <- sapply(df[, 2:ncol(df)], function(Col) sum(is.na(Col)))
Col_positions <- order(NAcounts, decreasing = TRUE) + 1
dfNew <- df[,(c(1, Col_positions))]
dfNew
#>    ID numeric_col3 text_col3 numeric_col2 text_col2 numeric_col1 text_col1
#> 1   1    3.1982062         d  8.950453592         g    2.8757752         s
#> 2   2    3.0772001      <NA>  3.744627759      <NA>    7.8830514         x
#> 3   3    2.1976763      <NA>  6.651151946         z    4.0897692         t
#> 4   4    3.6948887         e  0.948406609         j    8.8301740         o
#> 5   5    9.8421920         u  3.839696378         x    9.4046728         g
#> 6   6    1.5420230         y  2.743836446         v    0.4555650         d
#> 7   7           NA         h  8.146400389         w           NA         a
#> 8   8    1.4190691         w  4.485163414         z    8.9241904      <NA>
#> 9   9           NA      <NA>  8.100643530         k    5.5143501      <NA>
#> 10 10    6.1925648         u  8.123895095         a    4.5661474      <NA>
#> 11 11           NA         m  7.943423211         y    9.5683335         p
#> 12 12    6.7299909         r           NA         s    4.5333416         x
#> 13 13           NA         j  7.544751586         j    6.7757064         v
#> 14 14    5.2113573         f           NA         u    5.7263340         k
#> 15 15    6.5983845         g  7.101824014         v    1.0292468         p
#> 16 16           NA         i  0.006247733         m    8.9982497         t
#> 17 17    7.8628155         p  4.753165741      <NA>    2.4608773         h
#> 18 18    9.7982192         q  2.201188852         k    0.4205953      <NA>
#> 19 19           NA         u  3.798165377         y    3.2792072         d
#> 20 20    3.1170220         m  6.127710033      <NA>    9.5450365         t
#> 21 21           NA         u  3.517979092         z    8.8953932         l
#> 22 22    0.1046711         h  1.111354243         g    6.9280341         v
#> 23 23    1.8384952      <NA>  2.436194727         y    6.4050681         q
#> 24 24           NA         g  6.680555874         w    9.9426978         j
#> 25 25           NA         t  4.176467797      <NA>           NA      <NA>
#> 26 26           NA         r  7.881958340         t    7.0853047         k
#> 27 27    0.7669117      <NA>  1.028646443         x           NA         y
#> 28 28    2.4572368         a  4.348927415         y    5.9414202         w
#> 29 29    7.3213521         b           NA      <NA>    2.8915974         h
#> 30 30    8.4745317      <NA>           NA      <NA>    1.4711365         n
#> 31 31    4.9752727      <NA>           NA      <NA>    9.6302423         u
#> 32 32           NA         m  1.750526503         w           NA         m
#> 33 33    2.4644899         x  1.306956916      <NA>    6.9070528         b
#> 34 34           NA         q  6.531019250         n    7.9546742         k
#> 35 35    3.8999444      <NA>           NA      <NA>    0.2461368         m
#> 36 36           NA         r  6.567581280         a    4.7779597         n
#> 37 37    2.1689276         b  3.203732425      <NA>    7.5845954         f
#> 38 38    4.4476800         e  1.876911193         j           NA         y
#> 39 39           NA      <NA>  7.822943013         q    3.1818101         h
#> 40 40    5.0229956      <NA>           NA         z    2.3162579         l
#> 41 41    3.5390457      <NA>  4.667790416         q    1.4280002         z
#> 42 42    6.4998516         j  5.115054599         z    4.1454634         d
#> 43 43    3.7471396         f  5.999889593         u    4.1372433         m
#> 44 44           NA         j  3.328235403         g    3.6884545         n
#> 45 45    5.3368795      <NA>  4.886130337         u    1.5244475         u
#> 46 46    7.4033436      <NA>  9.544738275         z    1.3880606         p
#> 47 47    2.2110294      <NA>           NA         i    2.3303410         w
#> 48 48    4.1274612         p           NA         t    4.6596245         a
#> 49 49    2.6568669         c  9.144381869         f    2.6597264         h
#> 50 50    6.2997305      <NA>           NA         r    8.5782772         h

^{Created on 2024-01-10 with reprex v2.0.2}

scottyd22 · January 10, 2024, 3:35pm

Hi @GreyMerchant. Here is another approach that attempts to get at the sorting you described.

# idenify missing values
missing = unlist(lapply(1:length(df), function(i) {sum(is.na(df[,i]))}))
names(missing) = names(df)
missing_arranged = sort(missing, decreasing = T) |> names()
missing_arranged = c('ID', missing_arranged[missing_arranged != 'ID'])

# set new column order
df = df[,missing_arranged]

# sort
library(tidyverse)

df |>
  # force NAs to -1 or '0' for sorting purposes
  mutate_if(is.numeric, ~replace_na(.,-1)) |>
  mutate_if(is.character, ~replace_na(.,'0')) |>
  arrange_at(2:length(df), ~.) |>
  # turn -1 and aaaaaaa back to NA
  mutate_if(is.numeric, ~ifelse(. == -1, NA, .)) |>
  mutate_if(is.character, ~ifelse(. == '0', NA, .))
#>    ID numeric_col3 text_col3 numeric_col2 text_col2 numeric_col1 text_col1
#> 1  39           NA      <NA>  7.822943013         q    3.1818101         h
#> 2   9           NA      <NA>  8.100643530         k    5.5143501      <NA>
#> 3  24           NA         g  6.680555874         w    9.9426978         j
#> 4   7           NA         h  8.146400389         w           NA         a
#> 5  16           NA         i  0.006247733         m    8.9982497         t
#> 6  44           NA         j  3.328235403         g    3.6884545         n
#> 7  13           NA         j  7.544751586         j    6.7757064         v
#> 8  32           NA         m  1.750526503         w           NA         m
#> 9  11           NA         m  7.943423211         y    9.5683335         p
#> 10 34           NA         q  6.531019250         n    7.9546742         k
#> 11 36           NA         r  6.567581280         a    4.7779597         n
#> 12 26           NA         r  7.881958340         t    7.0853047         k
#> 13 25           NA         t  4.176467797      <NA>           NA      <NA>
#> 14 21           NA         u  3.517979092         z    8.8953932         l
#> 15 19           NA         u  3.798165377         y    3.2792072         d
#> 16 22    0.1046711         h  1.111354243         g    6.9280341         v
#> 17 27    0.7669117      <NA>  1.028646443         x           NA         y
#> 18  8    1.4190691         w  4.485163414         z    8.9241904      <NA>
#> 19  6    1.5420230         y  2.743836446         v    0.4555650         d
#> 20 23    1.8384952      <NA>  2.436194727         y    6.4050681         q
#> 21 37    2.1689276         b  3.203732425      <NA>    7.5845954         f
#> 22  3    2.1976763      <NA>  6.651151946         z    4.0897692         t
#> 23 47    2.2110294      <NA>           NA         i    2.3303410         w
#> 24 28    2.4572368         a  4.348927415         y    5.9414202         w
#> 25 33    2.4644899         x  1.306956916      <NA>    6.9070528         b
#> 26 49    2.6568669         c  9.144381869         f    2.6597264         h
#> 27  2    3.0772001      <NA>  3.744627759      <NA>    7.8830514         x
#> 28 20    3.1170220         m  6.127710033      <NA>    9.5450365         t
#> 29  1    3.1982062         d  8.950453592         g    2.8757752         s
#> 30 41    3.5390457      <NA>  4.667790416         q    1.4280002         z
#> 31  4    3.6948887         e  0.948406609         j    8.8301740         o
#> 32 43    3.7471396         f  5.999889593         u    4.1372433         m
#> 33 35    3.8999444      <NA>           NA      <NA>    0.2461368         m
#> 34 48    4.1274612         p           NA         t    4.6596245         a
#> 35 38    4.4476800         e  1.876911193         j           NA         y
#> 36 31    4.9752727      <NA>           NA      <NA>    9.6302423         u
#> 37 40    5.0229956      <NA>           NA         z    2.3162579         l
#> 38 14    5.2113573         f           NA         u    5.7263340         k
#> 39 45    5.3368795      <NA>  4.886130337         u    1.5244475         u
#> 40 10    6.1925648         u  8.123895095         a    4.5661474      <NA>
#> 41 50    6.2997305      <NA>           NA         r    8.5782772         h
#> 42 42    6.4998516         j  5.115054599         z    4.1454634         d
#> 43 15    6.5983845         g  7.101824014         v    1.0292468         p
#> 44 12    6.7299909         r           NA         s    4.5333416         x
#> 45 29    7.3213521         b           NA      <NA>    2.8915974         h
#> 46 46    7.4033436      <NA>  9.544738275         z    1.3880606         p
#> 47 17    7.8628155         p  4.753165741      <NA>    2.4608773         h
#> 48 30    8.4745317      <NA>           NA      <NA>    1.4711365         n
#> 49 18    9.7982192         q  2.201188852         k    0.4205953      <NA>
#> 50  5    9.8421920         u  3.839696378         x    9.4046728         g

Created on 2024-01-10 with reprex v2.0.2

nirgrahamuk · January 10, 2024, 3:35pm

I anticipate that they intend to sort by this column order. arrange(everything() on dfNew, its only that arrange is inflexible about NA's insisting on putting them last; so thats the part that would need some attention.

An easy hack might be to convert NA's in this set all to -Inf and then convert them back if needed.

a more complete / general approach migh be like

order_with_na_last <- purrr::partial(.f=order,na.last=FALSE)
df_fin <- dfNew[do.call(order_with_na_last ,dfNew),]

jrkrideau · January 10, 2024, 4:13pm

Is base::order as inflexible?

nirgrahamuk · January 10, 2024, 4:30pm

I would say no, as it gives an na.last paramater by which it could be sent to the back, or sent to the front as the original poster asked for.

jrkrideau · January 10, 2024, 4:35pm

I had that feeling but hoped.

dromano · January 10, 2024, 6:13pm

An alternative might be to use is.na() in arrange():

library(tidyverse)
tibble(x = 1:3, y = 1) |> 
  bind_rows(tibble(y = 1:2))
#> # A tibble: 5 × 2
#>       x     y
#>   <int> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     3     1
#> 4    NA     1
#> 5    NA     2

tibble(x = 1:3, y = 1) |> 
  bind_rows(tibble(y = 1:2)) |> 
  arrange(desc(is.na(x)), x)
#> # A tibble: 5 × 2
#>       x     y
#>   <int> <dbl>
#> 1    NA     1
#> 2    NA     2
#> 3     1     1
#> 4     2     1
#> 5     3     1

^{Created on 2024-01-10 with reprex v2.0.2}

system · January 31, 2024, 6:13pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.