I have a bit of a weird one. I have this reprex below showing a toy version of my df. My actual dataframe looks very different and has a lot more columns.
I want to find a way to arrange my columns by keeping ID first followed by the columns that have the most missing values to least. I then want to sort across all columns simultaneously trying to have the missing values first followed by ascending order for numeric or ascending order for text. How can I do this?
# Set seed for reproducibility
set.seed(123)
# Create ID column
ID <- 1:50
# Create numeric columns with random values and missing data
numeric_col1 <- runif(50, 0, 10)
numeric_col1[sample(1:50, 5)] <- NA
numeric_col2 <- runif(50, 0, 10)
numeric_col2[sample(1:50, 10)] <- NA
numeric_col3 <- runif(50, 0, 10)
numeric_col3[sample(1:50, 15)] <- NA
# Create text columns with random values and missing data
text_col1 <- sample(letters, 50, replace = TRUE)
text_col1[sample(1:50, 5)] <- NA
text_col2 <- sample(letters, 50, replace = TRUE)
text_col2[sample(1:50, 10)] <- NA
text_col3 <- sample(letters, 50, replace = TRUE)
text_col3[sample(1:50, 15)] <- NA
# Create the dataframe
df <- data.frame(ID, numeric_col1, numeric_col2, numeric_col3, text_col1, text_col2, text_col3)
# Display the resulting dataframe
print(df)
#> ID numeric_col1 numeric_col2 numeric_col3 text_col1 text_col2 text_col3
#> 1 1 2.8757752 8.950453592 3.1982062 s g d
#> 2 2 7.8830514 3.744627759 3.0772001 x <NA> <NA>
#> 3 3 4.0897692 6.651151946 2.1976763 t z <NA>
#> 4 4 8.8301740 0.948406609 3.6948887 o j e
#> 5 5 9.4046728 3.839696378 9.8421920 g x u
#> 6 6 0.4555650 2.743836446 1.5420230 d v y
#> 7 7 NA 8.146400389 NA a w h
#> 8 8 8.9241904 4.485163414 1.4190691 <NA> z w
#> 9 9 5.5143501 8.100643530 NA <NA> k <NA>
#> 10 10 4.5661474 8.123895095 6.1925648 <NA> a u
#> 11 11 9.5683335 7.943423211 NA p y m
#> 12 12 4.5333416 NA 6.7299909 x s r
#> 13 13 6.7757064 7.544751586 NA v j j
#> 14 14 5.7263340 NA 5.2113573 k u f
#> 15 15 1.0292468 7.101824014 6.5983845 p v g
#> 16 16 8.9982497 0.006247733 NA t m i
#> 17 17 2.4608773 4.753165741 7.8628155 h <NA> p
#> 18 18 0.4205953 2.201188852 9.7982192 <NA> k q
#> 19 19 3.2792072 3.798165377 NA d y u
#> 20 20 9.5450365 6.127710033 3.1170220 t <NA> m
#> 21 21 8.8953932 3.517979092 NA l z u
#> 22 22 6.9280341 1.111354243 0.1046711 v g h
#> 23 23 6.4050681 2.436194727 1.8384952 q y <NA>
#> 24 24 9.9426978 6.680555874 NA j w g
#> 25 25 NA 4.176467797 NA <NA> <NA> t
#> 26 26 7.0853047 7.881958340 NA k t r
#> 27 27 NA 1.028646443 0.7669117 y x <NA>
#> 28 28 5.9414202 4.348927415 2.4572368 w y a
#> 29 29 2.8915974 NA 7.3213521 h <NA> b
#> 30 30 1.4711365 NA 8.4745317 n <NA> <NA>
#> 31 31 9.6302423 NA 4.9752727 u <NA> <NA>
#> 32 32 NA 1.750526503 NA m w m
#> 33 33 6.9070528 1.306956916 2.4644899 b <NA> x
#> 34 34 7.9546742 6.531019250 NA k n q
#> 35 35 0.2461368 NA 3.8999444 m <NA> <NA>
#> 36 36 4.7779597 6.567581280 NA n a r
#> 37 37 7.5845954 3.203732425 2.1689276 f <NA> b
#> 38 38 NA 1.876911193 4.4476800 y j e
#> 39 39 3.1818101 7.822943013 NA h q <NA>
#> 40 40 2.3162579 NA 5.0229956 l z <NA>
#> 41 41 1.4280002 4.667790416 3.5390457 z q <NA>
#> 42 42 4.1454634 5.115054599 6.4998516 d z j
#> 43 43 4.1372433 5.999889593 3.7471396 m u f
#> 44 44 3.6884545 3.328235403 NA n g j
#> 45 45 1.5244475 4.886130337 5.3368795 u u <NA>
#> 46 46 1.3880606 9.544738275 7.4033436 p z <NA>
#> 47 47 2.3303410 NA 2.2110294 w i <NA>
#> 48 48 4.6596245 NA 4.1274612 a t p
#> 49 49 2.6597264 9.144381869 2.6568669 h f c
#> 50 50 8.5782772 NA 6.2997305 h r <NA>
I think this does the first part of what you want. I'm not sure about the part
" I then want to sort across all columns simultaneously trying to have the missing values first followed by ascending order for numeric or ascending order for text. "
Do you want to sort the columns independently so you lose the correlation to the ID?
set.seed(123)
# Create ID column
ID <- 1:50
# Create numeric columns with random values and missing data
numeric_col1 <- runif(50, 0, 10)
numeric_col1[sample(1:50, 5)] <- NA
numeric_col2 <- runif(50, 0, 10)
numeric_col2[sample(1:50, 10)] <- NA
numeric_col3 <- runif(50, 0, 10)
numeric_col3[sample(1:50, 15)] <- NA
# Create text columns with random values and missing data
text_col1 <- sample(letters, 50, replace = TRUE)
text_col1[sample(1:50, 5)] <- NA
text_col2 <- sample(letters, 50, replace = TRUE)
text_col2[sample(1:50, 10)] <- NA
text_col3 <- sample(letters, 50, replace = TRUE)
text_col3[sample(1:50, 15)] <- NA
# Create the dataframe
df <- data.frame(ID, numeric_col1, numeric_col2, numeric_col3, text_col1, text_col2, text_col3)
NAcounts <- sapply(df[, 2:ncol(df)], function(Col) sum(is.na(Col)))
Col_positions <- order(NAcounts, decreasing = TRUE) + 1
dfNew <- df[,(c(1, Col_positions))]
dfNew
#> ID numeric_col3 text_col3 numeric_col2 text_col2 numeric_col1 text_col1
#> 1 1 3.1982062 d 8.950453592 g 2.8757752 s
#> 2 2 3.0772001 <NA> 3.744627759 <NA> 7.8830514 x
#> 3 3 2.1976763 <NA> 6.651151946 z 4.0897692 t
#> 4 4 3.6948887 e 0.948406609 j 8.8301740 o
#> 5 5 9.8421920 u 3.839696378 x 9.4046728 g
#> 6 6 1.5420230 y 2.743836446 v 0.4555650 d
#> 7 7 NA h 8.146400389 w NA a
#> 8 8 1.4190691 w 4.485163414 z 8.9241904 <NA>
#> 9 9 NA <NA> 8.100643530 k 5.5143501 <NA>
#> 10 10 6.1925648 u 8.123895095 a 4.5661474 <NA>
#> 11 11 NA m 7.943423211 y 9.5683335 p
#> 12 12 6.7299909 r NA s 4.5333416 x
#> 13 13 NA j 7.544751586 j 6.7757064 v
#> 14 14 5.2113573 f NA u 5.7263340 k
#> 15 15 6.5983845 g 7.101824014 v 1.0292468 p
#> 16 16 NA i 0.006247733 m 8.9982497 t
#> 17 17 7.8628155 p 4.753165741 <NA> 2.4608773 h
#> 18 18 9.7982192 q 2.201188852 k 0.4205953 <NA>
#> 19 19 NA u 3.798165377 y 3.2792072 d
#> 20 20 3.1170220 m 6.127710033 <NA> 9.5450365 t
#> 21 21 NA u 3.517979092 z 8.8953932 l
#> 22 22 0.1046711 h 1.111354243 g 6.9280341 v
#> 23 23 1.8384952 <NA> 2.436194727 y 6.4050681 q
#> 24 24 NA g 6.680555874 w 9.9426978 j
#> 25 25 NA t 4.176467797 <NA> NA <NA>
#> 26 26 NA r 7.881958340 t 7.0853047 k
#> 27 27 0.7669117 <NA> 1.028646443 x NA y
#> 28 28 2.4572368 a 4.348927415 y 5.9414202 w
#> 29 29 7.3213521 b NA <NA> 2.8915974 h
#> 30 30 8.4745317 <NA> NA <NA> 1.4711365 n
#> 31 31 4.9752727 <NA> NA <NA> 9.6302423 u
#> 32 32 NA m 1.750526503 w NA m
#> 33 33 2.4644899 x 1.306956916 <NA> 6.9070528 b
#> 34 34 NA q 6.531019250 n 7.9546742 k
#> 35 35 3.8999444 <NA> NA <NA> 0.2461368 m
#> 36 36 NA r 6.567581280 a 4.7779597 n
#> 37 37 2.1689276 b 3.203732425 <NA> 7.5845954 f
#> 38 38 4.4476800 e 1.876911193 j NA y
#> 39 39 NA <NA> 7.822943013 q 3.1818101 h
#> 40 40 5.0229956 <NA> NA z 2.3162579 l
#> 41 41 3.5390457 <NA> 4.667790416 q 1.4280002 z
#> 42 42 6.4998516 j 5.115054599 z 4.1454634 d
#> 43 43 3.7471396 f 5.999889593 u 4.1372433 m
#> 44 44 NA j 3.328235403 g 3.6884545 n
#> 45 45 5.3368795 <NA> 4.886130337 u 1.5244475 u
#> 46 46 7.4033436 <NA> 9.544738275 z 1.3880606 p
#> 47 47 2.2110294 <NA> NA i 2.3303410 w
#> 48 48 4.1274612 p NA t 4.6596245 a
#> 49 49 2.6568669 c 9.144381869 f 2.6597264 h
#> 50 50 6.2997305 <NA> NA r 8.5782772 h
Hi @GreyMerchant. Here is another approach that attempts to get at the sorting you described.
# idenify missing values
missing = unlist(lapply(1:length(df), function(i) {sum(is.na(df[,i]))}))
names(missing) = names(df)
missing_arranged = sort(missing, decreasing = T) |> names()
missing_arranged = c('ID', missing_arranged[missing_arranged != 'ID'])
# set new column order
df = df[,missing_arranged]
# sort
library(tidyverse)
df |>
# force NAs to -1 or '0' for sorting purposes
mutate_if(is.numeric, ~replace_na(.,-1)) |>
mutate_if(is.character, ~replace_na(.,'0')) |>
arrange_at(2:length(df), ~.) |>
# turn -1 and aaaaaaa back to NA
mutate_if(is.numeric, ~ifelse(. == -1, NA, .)) |>
mutate_if(is.character, ~ifelse(. == '0', NA, .))
#> ID numeric_col3 text_col3 numeric_col2 text_col2 numeric_col1 text_col1
#> 1 39 NA <NA> 7.822943013 q 3.1818101 h
#> 2 9 NA <NA> 8.100643530 k 5.5143501 <NA>
#> 3 24 NA g 6.680555874 w 9.9426978 j
#> 4 7 NA h 8.146400389 w NA a
#> 5 16 NA i 0.006247733 m 8.9982497 t
#> 6 44 NA j 3.328235403 g 3.6884545 n
#> 7 13 NA j 7.544751586 j 6.7757064 v
#> 8 32 NA m 1.750526503 w NA m
#> 9 11 NA m 7.943423211 y 9.5683335 p
#> 10 34 NA q 6.531019250 n 7.9546742 k
#> 11 36 NA r 6.567581280 a 4.7779597 n
#> 12 26 NA r 7.881958340 t 7.0853047 k
#> 13 25 NA t 4.176467797 <NA> NA <NA>
#> 14 21 NA u 3.517979092 z 8.8953932 l
#> 15 19 NA u 3.798165377 y 3.2792072 d
#> 16 22 0.1046711 h 1.111354243 g 6.9280341 v
#> 17 27 0.7669117 <NA> 1.028646443 x NA y
#> 18 8 1.4190691 w 4.485163414 z 8.9241904 <NA>
#> 19 6 1.5420230 y 2.743836446 v 0.4555650 d
#> 20 23 1.8384952 <NA> 2.436194727 y 6.4050681 q
#> 21 37 2.1689276 b 3.203732425 <NA> 7.5845954 f
#> 22 3 2.1976763 <NA> 6.651151946 z 4.0897692 t
#> 23 47 2.2110294 <NA> NA i 2.3303410 w
#> 24 28 2.4572368 a 4.348927415 y 5.9414202 w
#> 25 33 2.4644899 x 1.306956916 <NA> 6.9070528 b
#> 26 49 2.6568669 c 9.144381869 f 2.6597264 h
#> 27 2 3.0772001 <NA> 3.744627759 <NA> 7.8830514 x
#> 28 20 3.1170220 m 6.127710033 <NA> 9.5450365 t
#> 29 1 3.1982062 d 8.950453592 g 2.8757752 s
#> 30 41 3.5390457 <NA> 4.667790416 q 1.4280002 z
#> 31 4 3.6948887 e 0.948406609 j 8.8301740 o
#> 32 43 3.7471396 f 5.999889593 u 4.1372433 m
#> 33 35 3.8999444 <NA> NA <NA> 0.2461368 m
#> 34 48 4.1274612 p NA t 4.6596245 a
#> 35 38 4.4476800 e 1.876911193 j NA y
#> 36 31 4.9752727 <NA> NA <NA> 9.6302423 u
#> 37 40 5.0229956 <NA> NA z 2.3162579 l
#> 38 14 5.2113573 f NA u 5.7263340 k
#> 39 45 5.3368795 <NA> 4.886130337 u 1.5244475 u
#> 40 10 6.1925648 u 8.123895095 a 4.5661474 <NA>
#> 41 50 6.2997305 <NA> NA r 8.5782772 h
#> 42 42 6.4998516 j 5.115054599 z 4.1454634 d
#> 43 15 6.5983845 g 7.101824014 v 1.0292468 p
#> 44 12 6.7299909 r NA s 4.5333416 x
#> 45 29 7.3213521 b NA <NA> 2.8915974 h
#> 46 46 7.4033436 <NA> 9.544738275 z 1.3880606 p
#> 47 17 7.8628155 p 4.753165741 <NA> 2.4608773 h
#> 48 30 8.4745317 <NA> NA <NA> 1.4711365 n
#> 49 18 9.7982192 q 2.201188852 k 0.4205953 <NA>
#> 50 5 9.8421920 u 3.839696378 x 9.4046728 g
I anticipate that they intend to sort by this column order. arrange(everything() on dfNew, its only that arrange is inflexible about NA's insisting on putting them last; so thats the part that would need some attention.
An easy hack might be to convert NA's in this set all to -Inf and then convert them back if needed.