how to keep the top n levels of a factor?

von_olaf · June 14, 2019, 6:56pm

Hello there,

I am struggling with something that is perhaps very simple.
Consider this factor :

> factor(c('a','b','c','d','a','b'))
[1] a b c d a b
Levels: a b c d

This factor is already sorted by order of importance.

That is a is better than b, and so on. I would like to keep the first 2 top levels, and put the rest in some other category. Very much like fct_lump but here the lumping has nothing to do with the frequency (they all appear once).

Can I do that with forcats ?
Thanks!

Yarnabrina · June 14, 2019, 7:24pm

There's a chance that you are mistaken. The levels are displayed in alphabetical order, but they are not ordered here. Note the difference below:

> a <- factor(x = c('a','b','c','d','a','b'))
> a
[1] a b c d a b
Levels: a b c d
> is.ordered(x = a)
[1] FALSE
> b <- factor(x = c('a','b','c','d','a','b'), ordered = TRUE)
> b
[1] a b c d a b
Levels: a < b < c < d
> is.ordered(x = b)
[1] TRUE

I think you're looking for something like this:

set.seed(seed = 33122)
factor_data <- factor(x = sample(x = letters[1:5],
                                 size = 20,
                                 replace = TRUE),
                      ordered = TRUE)
factor_data
#>  [1] d c e e c e b a a b b d d e c e b e c a
#> Levels: a < b < c < d < e

forcats::fct_other(f = factor_data,
                   keep = tail(x = levels(x = factor_data),
                               n = 2))
#>  [1] d     Other e     e     Other e     Other Other Other Other Other
#> [12] d     d     e     Other e     Other e     Other Other
#> Levels: d < e < Other

^{Created on 2019-06-15 by the reprex package (v0.3.0)}

Hope this helps.

system · June 21, 2019, 7:24pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.