Proper use of fct_lump_min

I am trying to erase data that does not appear often enough for analysis anyway. For that I want to use fct_lump_min (see R: Lump together factor levels into "other") . You basically tell the function how often a value has to appear at least, otherwise, its value gets overwritten to in this case "Too Rare", which you then can search for and delete. Unfortunately, R has this weird tendency to just erase everything in case it doesn't find anything to label as "Too Rare". In the example below, everything works as intended, as long as there is something to label (with n = 3 the bananas are omitted, but the apples stay). If you change the value to n=2 however, or if you concatenate the dataframe a couple of times with itself (also leading to having at least 3 bananas in the dataframe), everything is erased. Any idea on how to fix this?


#n=3, works as intended

Fruit<-c("Banana", "Apple", "Banana", "Apple", "Apple")
Origin<-c("New Guinea", "China","Germany", "USA", "Germany")
Quality<-c("Good", "Bad", "Good", "Very bad", "Decent")
Value<-c(50,75,80,60,30) #cents
Price<-c(1,2,1,3,1)     #euros

Fruits<-data.frame(Fruit, Origin, Quality, Value, Price)
#m <- 5
#Fruits<-do.call("rbind", replicate(m, Fruits, simplify = FALSE))
Fruits<-Fruits[-c(which(fct_lump_min(
  Fruits$`Fruit`, 
  3, w = NULL, other_level = "Too Rare") == "Too Rare")),]

#n=2, erases everything
Fruit<-c("Banana", "Apple", "Banana", "Apple", "Apple")
Origin<-c("New Guinea", "China","Germany", "USA", "Germany")
Quality<-c("Good", "Bad", "Good", "Very bad", "Decent")
Value<-c(50,75,80,60,30) #cents
Price<-c(1,2,1,3,1)     #euros

Fruits<-data.frame(Fruit, Origin, Quality, Value, Price)
#m <- 5
#Fruits<-do.call("rbind", replicate(m, Fruits, simplify = FALSE))
Fruits<-Fruits[-c(which(fct_lump_min(
  Fruits$`Fruit`, 
  2, w = NULL, other_level = "Too Rare") == "Too Rare")),]

#Concatenation with n=3, erases everything

Fruit<-c("Banana", "Apple", "Banana", "Apple", "Apple")
Origin<-c("New Guinea", "China","Germany", "USA", "Germany")
Quality<-c("Good", "Bad", "Good", "Very bad", "Decent")
Value<-c(50,75,80,60,30) #cents
Price<-c(1,2,1,3,1)     #euros

Fruits<-data.frame(Fruit, Origin, Quality, Value, Price)
m <- 5
Fruits<-do.call("rbind", replicate(m, Fruits, simplify = FALSE))
Fruits<-Fruits[-c(which(fct_lump_min(
  Fruits$`Fruit`, 
  3, w = NULL, other_level = "Too Rare") == "Too Rare")),]
library(tidyverse)

Fruit<-c("Banana", "Apple", "Banana", "Apple", "Apple")
Origin<-c("New Guinea", "China","Germany", "USA", "Germany")
Quality<-c("Good", "Bad", "Good", "Very bad", "Decent")
Value<-c(50,75,80,60,30) #cents
Price<-c(1,2,1,3,1)     #euros

Fruits<-data.frame(Fruit, Origin, Quality, Value, Price)

# When there are no rows with "Too Rare" this code returns 0
which(fct_lump_min(
  Fruits$`Fruit`, 
  2, w = NULL, other_level = "Too Rare") == "Too Rare")
#> integer(0)

# and...
Fruits[c(0), ]
#> [1] Fruit   Origin  Quality Value   Price  
#> <0 rows> (or 0-length row.names)
Fruits[-c(0), ]
#> [1] Fruit   Origin  Quality Value   Price  
#> <0 rows> (or 0-length row.names)

# This works for n = 3 and n = 2

Fruits |> 
  mutate(Fruit = fct_lump_min(Fruit, 3, other_level = "Too Rare")) |> 
  filter (Fruit != "Too Rare")
#>   Fruit  Origin  Quality Value Price
#> 1 Apple   China      Bad    75     2
#> 2 Apple     USA Very bad    60     3
#> 3 Apple Germany   Decent    30     1

Fruits |> 
  mutate(Fruit = fct_lump_min(Fruit, 2, other_level = "Too Rare")) |> 
  filter (Fruit != "Too Rare")
#>    Fruit     Origin  Quality Value Price
#> 1 Banana New Guinea     Good    50     1
#> 2  Apple      China      Bad    75     2
#> 3 Banana    Germany     Good    80     1
#> 4  Apple        USA Very bad    60     3
#> 5  Apple    Germany   Decent    30     1

Created on 2022-08-07 by the reprex package (v2.0.1)

Thank you very much for your answer. My R does not understand the expression "|>", but after replacing it with "%>%" it does what I want it to.

The native pipe operator |> was introduced in May 2021 as part of R 4.1 (the current version of R is 4.2). Sorry for the confusion.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.