Sorting ordinal Data

Der_Blauebaer · August 1, 2020, 12:39pm

Hi here is my second try, this time with a REPREX (I hope it works).

I want to sort ordinal data (a factor) from 104 observations in 43 variables (first is the Name of the Observation and 42 variables) coded as 1, 2, and 3. Naturally R thinks it is a numerical data. So I changed the Data typ.

The question would be if R can arrange the data in ascending order. It is biological data so we have overlapping and missing data. R should arrange the observations with as little overlap as possible.

Which package should be used and which commands?

My REPREX would look like this:

Ind. <- c("a", "b", "c", "d", "e", "f")
obs.1 <- c(1, 1, 2, NA, 2, 3)
obs.2 <- c(2, 2, 3, 3, 2, NA)
obs.3 <- c(1, 2, 2, NA, 3, NA)
obs.4 <- c(1, 2, 3, 3, 2, 3)
DATA <- data.frame(Ind., obs.1, obs.2, obs.3, obs.4)   

library(dplyr)
DATA <- DATA %>%
  mutate_if(is.numeric, ~as.factor(as.character(.)))

The Solution would look like row as a, b, c, f, d, e, an columns as Ind., obs.1, obs.3. Obs.2 and obs.4, obs.3.

toryn_stat · August 1, 2020, 1:56pm

Hello, could you provide a little more information on your arranging criteria? What does it mean with as little overlap as possible? Also, your solution has a row g, but your example does not.

Der_Blauebaer · August 1, 2020, 2:04pm

You are right there is no row g, thank you. But the only criteria is ascending and little overlap, all observations are equivalent importent. There is overlap in the data, like in Ind. f where the obs.3 is higher than obs.2 and obs.4 unlike any other cases.

toryn_stat · August 1, 2020, 2:16pm

Ok. It's still not clear what you mean be little overlap. Ind. f has an NA for obs.3. Could you provide a quantitative description of little overlap? Is it the variability in obs.1-obs.4? So you want to sort columns by variability and rows by number of missing observations?

Der_Blauebaer · August 1, 2020, 2:32pm

It is developmental data, all places I observed ascend from 1 to 3 during maturation. I want the order in which they ascend and how well developed the Individual is compared to the population. But because it is biological data we have places where the individual develops faster or slower depending on abiological factors. It is not possible to just filter these places or individuals out, because every observation and most Individuals have these Problems. Missing data is no problem in the evaluation but shouldn't be filtered out because nearly every Individual has missing data. I hope that makes things clearer.

toryn_stat · August 1, 2020, 2:48pm

I appreciate the insight into the data, however, it still doesn't clarify the task at hand. If you could provide a quantifiable measure on which you wish to sort the data such as arrange rows by increasing number of missing observations, then I could attempt to provide a solution.

Der_Blauebaer · August 1, 2020, 3:46pm

I want to sort the Rows by the over all ascension of the columns and vice versa.

|2|3|2| → |1|2|3|
|3|3|3| → |2|2|3|
|1|3|2| → |3|3|3|

The problem is that I cant just calculate an average because it is ordinal data, data is missing and some observation per Individual are not in order.
Im not sure if there is even a comand for that or Im just asking at the wrong place. If the data wouldn't be that complex it would be rather simpel.

mrmallironmaker · August 1, 2020, 5:52pm

I think the trouble is that (if I understand it) the process you have described is not well-defined enough for a computer to process it. There are a couple of ways I might formalize your intuition:

These use notation M(x,y) to get the item in the matrix at column x and row y.

Arrange the rows and columns such that EVERY row and EVERY column is increasing. If they cant't be put in that way, then throw an error.
Arrange the rows and columns such that there are fewest instances in which M(x,y) > M(x,y+1) or M(x,y) > M(x+1, y)
Arrange the rows and columns such that there are fewest instances in which M(x,y) > M(x,z) for y < z or M(x,y) > M(z, y) for x < z.

These definitions certainly lead to different solutions, so it is important to be sure you have a precise knowledge of what you want.

#1 is easiest to code, but may give you an error if the results aren't perfectly clear.

Does that make sense? Hope that helps.

Der_Blauebaer · August 1, 2020, 6:09pm

Thank you that helps al lot. #1 would always lead to an error with around 90 % of my data and i think that would be easly possibel in Excel. #2 and #3 are what I want. What would be needed for #2 and #3? #2 would be best bcause most of the deviations are "defects" at a singel point.

mrmallironmaker · August 2, 2020, 5:25am

I'm not aware of an algorithm that does 2 or 3, unfortunately. My intuition is that it's an NP-hard problem, which means in the worst case you can't do much better than trying all combinations and seeing which is the best.

I think it would be easier (still not super easy) if you only cared about rows, or columns.

If you're OK with approximate results, that can be done, but it might some cleverness to program it.

Unless the levels 1,2,3 are dramatically nonlinear, I'd be pretty comfortable averaging each row and column, and sorting them.

Of course, I'm no expert in algorithms, so I may be wrong. Always a plausible option

If you can tell me what context you're using these results in, I might be able to tell you which method I'd recommend.

Der_Blauebaer · August 2, 2020, 8:41am

In Excel I already experimented with averaging rows and columns and sorting them that way. Because it is ordinal data it is not quit right to do so but the results looked good. I hoped there was an already programed solution in R. Programming an algorithmus is beyond my knowledge and I think nobody will do it for fun so i guss it is the average for me. Thank you all!

nirgrahamuk · August 2, 2020, 9:54am

It seems like you want to reorder the obs columns (differently for each indicator) so that the values ascend, and then apply an ordering of the rows that reflects the rising minimal obs values. something like

Ind. <- c("a", "b", "c", "d", "e", "f")
obs.1 <- c(1, 1, 2, NA, 2, 3)
obs.2 <- c(2, 2, 3, 3, 2, NA)
obs.3 <- c(1, 2, 2, NA, 3, NA)
obs.4 <- c(1, 2, 3, 3, 2, 3)
DATA <- data.frame(Ind., obs.1, obs.2, obs.3, obs.4)   

library(tidyverse)

DATA <- mutate_if(DATA,
               is.numeric, ~if_else(is.na(.),999,.))
# for a given ind, we dont care which obs gave which value, we just want the values in ascending order...
 
DATA1 <- nest_by(DATA,Ind.)
DATA2 <- mutate(DATA1,
               datav = list(sort(as.matrix(data)))) %>% select(-data)

DATA2$datap <- purrr::map_chr(DATA2$datav,
                    ~paste0(as.character(.),collapse = ","))

DATA3 <- tidyr::separate(DATA2,datap,paste0("var_",1:4)) %>% select(-datav) 
DATA4 <- mutate_at(DATA3,
                   paste0("var_",1:4),~factor(.,levels=c("1","2","3")))

DATA5 <- arrange(DATA4,var_1,var_2,var_3,var_4)
# A tibble: 6 x 5
# Ind.  var_1 var_2 var_3 var_4
# <chr> <fct> <fct> <fct> <fct>
# 1 a     1     1     1     2    
# 2 b     1     2     2     2    
# 3 e     2     2     2     3    
# 4 c     2     2     3     3    
# 5 d     3     3     NA    NA   
# 6 f     3     3     NA    NA

Der_Blauebaer · August 2, 2020, 10:41am

Thank you, but that is not what I want. The rows and colums must stay the same but their order should be changed. The argument would be like mrmallironmaker said:
Use notation M(x,y) to get the item in the matrix at column x and row y.

Arrange the rows and columns such that there are fewest instances in which M(x,y) > M(x,y+1) or M(x,y) > M(x+1, y)

nirgrahamuk · August 2, 2020, 11:05am

That's fine, though it would seem to conflict with the given example.

Im curious as to the context behind, this, whynit would be desirable to come up with a rough column ordering with the requested properties. Is it intended to simply be pleasing to the eye?

Der_Blauebaer · August 2, 2020, 11:42am

I'm in historical anthropology, these are the epiphysis of a population (they are importend for bone growth). I want to order the rows that would tell me how well developed an Individuum is compared to the population and sorting the rows would tell me the order in which the epiphysis develop. I could compare the order of epiphysis development to modern populations with known age. And yeah by coloring the Boxes of 1, 2 or 3 it gives a nice graphic.

nirgrahamuk · August 2, 2020, 12:06pm

Thanks for explaining.
I understand if you'd prefer to skip over my subsequent questions, as they may be a distraction from your work but now I'm curious about this data...
So the Indi. a,b,c represent people that were studied, "jane","john" , etc.
the values are measurments of the size of a particular part of a particular bone. Why would these be factors / ordered levels rather than a measure of distance , cm/inches ?
Also, are the obs.1 , obs.2 columns representing fixed regular intervals of observation ?
Would observations be expected to significantly fluctaute (i .e. a 2 -> 3 -> 2 pattern) may be common ?

I think if observations were taken with regularity, and if the value measures represented bone size in a numeric way, then it might be reasonable to fit simple linear models , to decomopose each individuals pattern into slope and intercept and analyse them through such a lense.

Der_Blauebaer · August 2, 2020, 12:35pm

That was just an exampel my data looks more like this picture below. They are france soldiers that died in Rödelheim, the data is a snapshot. The Epihysis are a cartilage plate that ossify (1 = not ossified, 2 = beginning ossification and 3 = completly ossified). These Observation are 42 points at the body from from shoulder to toe. Some places are out of order, a good exampel is a point at the scapula where the acromion not fuses, in other context this symptom is associated with archery. I am very fond with my reaserch field so i'm happy if someone is interested.

nirgrahamuk · August 2, 2020, 12:38pm

Thanks for explaining how the measurements relate to a qualititive factor ( degree of ossification), also understand that the observations are not temporal but corresponding to locations on the body..I'll think about this some more.

nirgrahamuk · August 7, 2020, 8:06am

If the goal is summarisation and visual representation, then simply summarising each sample into 4 numbers (rather than 42) might be effective. 4 variables which are the count of how many sites were at a given grade of ossification (or undetermined).

Der_Blauebaer · August 10, 2020, 10:46am

For visualization I have taken the average of every row and every column and sorted the data like this. For visualization this approach seems to be okay, but it shows that the data is a mess. I will try your idea but I don't know how to sort by this 4 variables either. Thank you for your input and sorry for my late reply.