Is ICC appropriate for fully crossed forced-ranking data (6 items, no ties), or should agreement be assessed only with rank-based measures such as Kendall’s W?

I am working with a forced-ranking task in which 22 participants ranked 6 scenarios from easiest to hardest (ranks 1–6, no ties allowed). Each participant ranked all 6 scenarios (fully crossed design).

My goal is to validate a relative difficulty ordering among the scenarios (e.g., whether certain request types are perceived as more difficult than others), not to estimate an interval-scale difficulty measure.

Because the data are ordinal ranks, I computed Kendall’s W as the primary agreement measure. In one condition, W is moderate; in another, it is weaker but still significant.

Out of curiosity and as a possible triangulation, I also computed the Intraclass Correlation Coefficient (two-way model, absolute agreement) by treating the ranks (1–6) as numeric scores. The results show:

  • relatively low single-measure ICC,
  • but quite high average-measure ICC in one condition (around .90),
  • and moderate average-measure ICC (around .70) in the other.

My questions are:

  1. Is it theoretically defensible to compute ICC on forced-ranking data where ranks are strictly ordinal and equidistance between ranks is not guaranteed?
  2. Does the forced-ranking constraint (each rater must use each rank exactly once) tend to inflate ICC, especially for average measures?
  3. How should one interpret a situation where Kendall’s W suggests weak/moderate agreement, but ICC (average measures) appears relatively high?
  4. Would it be more appropriate to rely exclusively on rank-based agreement measures in this context?

I am mainly interested in whether using ICC here adds meaningful information, or whether it introduces assumptions that make it misleading.