What's the difference between an ordered variable and a factor?

davidhodge931 · April 25, 2024, 3:50am

I noticed these ordered variables in the diamonds practice dataset

Any ideas much appreciated

Thanks
David

FJCC · April 25, 2024, 5:25am

Imagine analyzing the life expectancy of a population based on three characteristics: the province where the person was born, their level of education, and their income. The analysis by place of birth uses and unordered variable. You can't say one province comes before another province in any meaningful way. They are simply labels for the data. The education could be an ordered variable if the information you have is whether each person attended no more than primary school, finished secondary, attended some college, etc. You can place the categories in order, but you can't really say how much "bigger" one category is than another. Finally, using income, you can quantitatively order the samples. You can say how much bigger one income is than another.
The label "ordered" is generally used for cases like the education, where an order exists but it isn't quantitative.
In my experience, it is very common to pretend that qualitatively ordered variables are actually quantitatively ordered. For example, the different levels of education could get mapped to integers, 1, 2, 3 etc. and those "numbers" are used to calculate regressions or other statistical quantities. A lot of meaningless analysis is the result.

davidhodge931 · April 25, 2024, 6:49am

Thanks @FJCC in that case is ordered variable the same as a factor that has levels assigned?

So in the below code, what is the difference between color and color2 variables?

library(tidyverse)

d <- diamonds |> 
  distinct(color) |> 
  mutate(color2 = factor(as.character(color), levels = LETTERS[4:10])) 

class(d$color)
#> [1] "ordered" "factor"
class(d$color2)
#> [1] "factor"

^{Created on 2024-04-25 with reprex v2.1.0}

margusl · April 25, 2024, 9:26am

Factors always have levels ("D", "E", ...). For ordered factors, those levels are also ordered ("D" < "E" < ...). I.e. you can do max(d$color) but max(d$color2) would not make sense.

diamonds$cut is perhaps a better example:

data("diamonds", package = "ggplot2")
str(diamonds$cut)
#>  Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
range(diamonds$cut)
#> [1] Fair  Ideal
#> Levels: Fair < Good < Very Good < Premium < Ideal

# drop order
cut2 <- diamonds$cut |> as.character() |> factor()
str(cut2)
#>  Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
range(cut2)
#> Error in Summary.factor(structure(c(3L, 4L, 2L, 4L, 2L, 5L, 5L, 5L, 1L, : 'range' not meaningful for factors

^{Created on 2024-04-25 with reprex v2.1.0}

strengejacke · April 25, 2024, 10:02am

The probably most obvious difference appears when you use a factor vs. an ordered factor in a regression model:

data(diamonds, package = "ggplot2")
diamonds$cut_f <- factor(diamonds$cut, ordered = FALSE)
m1 <- lm(price ~ cut, data = diamonds)
m2 <- lm(price ~ cut_f, data = diamonds)

parameters::compare_parameters(m1, m2)
#> Parameter         |                         m1 |                          m2
#> ----------------------------------------------------------------------------
#> (Intercept)       | 4062.24 (4012.45, 4112.02) | 4358.76 ( 4165.13, 4552.38)
#> cut (linear)      | -362.73 (-496.08, -229.37) |                            
#> cut (quadratic)   | -225.58 (-344.45, -106.71) |                            
#> cut (cubic)       | -699.50 (-802.94, -596.05) |                            
#> cut (4th degree)  | -280.36 (-363.77, -196.94) |                            
#> cut f (Good)      |                            | -429.89 ( -653.04, -206.75)
#> cut f (Very Good) |                            | -377.00 ( -583.12, -170.88)
#> cut f (Premium)   |                            |  225.50 (   20.88,  430.12)
#> cut f (Ideal)     |                            | -901.22 (-1101.94, -700.49)
#> ----------------------------------------------------------------------------
#> Observations      |                      53940 |                       53940

That's another reason why adjusted predictions (estimated marginal means, ...) are useful when it comes to interpreting model results. The predicted outcome is the same, so it doesn't matter if you use ordered factors or not.

ggeffects::predict_response(m1, "cut")
#> # Predicted values of price
#> 
#> cut       | Predicted |           95% CI
#> ----------------------------------------
#> Fair      |   4358.76 | 4165.13, 4552.38
#> Good      |   3928.86 | 3817.94, 4039.78
#> Very Good |   3981.76 | 3911.08, 4052.44
#> Premium   |   4584.26 | 4518.10, 4650.41
#> Ideal     |   3457.54 | 3404.62, 3510.46

ggeffects::predict_response(m2, "cut_f")
#> # Predicted values of price
#> 
#> cut_f     | Predicted |           95% CI
#> ----------------------------------------
#> Fair      |   4358.76 | 4165.13, 4552.38
#> Good      |   3928.86 | 3817.94, 4039.78
#> Very Good |   3981.76 | 3911.08, 4052.44
#> Premium   |   4584.26 | 4518.10, 4650.41
#> Ideal     |   3457.54 | 3404.62, 3510.46

system · July 24, 2024, 10:03am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.