I am working with a forced-ranking task in which 22 participants ranked 6 scenarios from easiest to hardest (ranks 1–6, no ties allowed). Each participant ranked all 6 scenarios (fully crossed design).
My goal is to validate a relative difficulty ordering among the scenarios (e.g., whether certain request types are perceived as more difficult than others), not to estimate an interval-scale difficulty measure.
Because the data are ordinal ranks, I computed Kendall’s W as the primary agreement measure. In one condition, W is moderate; in another, it is weaker but still significant.
Out of curiosity and as a possible triangulation, I also computed the Intraclass Correlation Coefficient (two-way model, absolute agreement) by treating the ranks (1–6) as numeric scores. The results show:
- relatively low single-measure ICC,
- but quite high average-measure ICC in one condition (around .90),
- and moderate average-measure ICC (around .70) in the other.
My questions are:
- Is it theoretically defensible to compute ICC on forced-ranking data where ranks are strictly ordinal and equidistance between ranks is not guaranteed?
- Does the forced-ranking constraint (each rater must use each rank exactly once) tend to inflate ICC, especially for average measures?
- How should one interpret a situation where Kendall’s W suggests weak/moderate agreement, but ICC (average measures) appears relatively high?
- Would it be more appropriate to rely exclusively on rank-based agreement measures in this context?
I am mainly interested in whether using ICC here adds meaningful information, or whether it introduces assumptions that make it misleading.