how to apply fct_lump_min within group_by

What is a way to apply the fct_lump_min within a group_by pipe? Meaning, I have a set of data where I want to group by two variables (year and group) and count the frequency (n). I also want to lump the variables such that for each year, the count of group is lumped where n < 5.

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.2
#> Warning: package 'ggplot2' was built under R version 4.2.2
#> Warning: package 'tibble' was built under R version 4.2.2
#> Warning: package 'tidyr' was built under R version 4.2.2
#> Warning: package 'readr' was built under R version 4.2.2
#> Warning: package 'purrr' was built under R version 4.2.2
#> Warning: package 'dplyr' was built under R version 4.2.2
#> Warning: package 'stringr' was built under R version 4.2.2
#> Warning: package 'forcats' was built under R version 4.2.2

# sample data 
set.seed(1234)
sample_data <- tibble(
  year = sample(seq(2001, 2005, 1), size = 100, replace = TRUE),
  group = sample(c("apple", "berry", "carrot", "fig"), size = 100, replace = TRUE),
  value = 1
)

# idea for what I want to accomplish
sample_data %>%
  group_by(
    year, 
    group = fct_lump_min(
      f = group,
      min = 5
  )) %>%
  count()
#> # A tibble: 20 × 3
#> # Groups:   year, group [20]
#>     year group      n
#>    <dbl> <fct>  <int>
#>  1  2001 apple      5
#>  2  2001 berry      5
#>  3  2001 carrot     4
#>  4  2001 fig        5
#>  5  2002 apple      3
#>  6  2002 berry      2
#>  7  2002 carrot     8
#>  8  2002 fig        6
#>  9  2003 apple      5
#> 10  2003 berry      7
#> 11  2003 carrot     5
#> 12  2003 fig        4
#> 13  2004 apple      1
#> 14  2004 berry      5
#> 15  2004 carrot    10
#> 16  2004 fig        6
#> 17  2005 apple      8
#> 18  2005 berry      3
#> 19  2005 carrot     4
#> 20  2005 fig        4

# target output 
desired_output <- tribble(
  ~"year", ~"group", ~"n",
  2001, "apple", 5,
  2001, "berry", 5,
  2001, "fig", 5,
  2001, "Other", 4,
  2002, "carrot", 8,
  2002, "fig", 6,
  2002, "Other", 5,
  2003, "apple", 5,
  2003, "berry", 7,
  2003, "carrot", 5,
  2003, "Other", 4,
  2004, "berry", 5,
  2004, "carrot", 10,
  2004, "fig", 6,
  2004, "Other", 1,
  2005, "apple", 8,
  2005, "Other", 11,
)

Created on 2023-03-10 with reprex v2.0.2

Maybe like this, since you need a factor to lump, you first create the factor, then lump it, then use it for grouping:

sample_data |>
  mutate(grouping_var = factor(paste0(group, "-", year)),
         grouping_var = fct_lump_min(grouping_var, min = 5)) |>
  count(grouping_var)
#> # A tibble: 13 × 2
#>    grouping_var     n
#>    <fct>        <int>
#>  1 apple-2001       5
#>  2 apple-2003       5
#>  3 apple-2005       8
#>  4 berry-2001       5
#>  5 berry-2003       7
#>  6 berry-2004       5
#>  7 carrot-2002      8
#>  8 carrot-2003      5
#>  9 carrot-2004     10
#> 10 fig-2001         5
#> 11 fig-2002         6
#> 12 fig-2004         6
#> 13 Other           25

Created on 2023-03-10 with reprex v2.0.2

A different approach would be to first count, then lump rows that are too low, but then we have to sum again to get a single count for Other. This might be more efficient for big datasets (because fct_lump() is summing the entire dataset only for count() to redo it, with the second approach the second sum is on a much smaller dataset), but here it's not the case:

bench::mark(
  factor_then_sum = {
    sample_data |>
      mutate(grouping_var = factor(paste0(group, "-", year)),
             grouping_var = fct_lump_min(grouping_var, min = 5)) |>
      count(grouping_var) %>%
      arrange(n) %>%
      mutate(grouping_var = as.character(grouping_var))
  },
  sum_then_factor_then_resum = {
    sample_data %>%
      group_by(group, year) %>%
      count() %>%
      mutate(grouping_var = if_else(n >= 5,
                                    paste0(group, "-", year),
                                    "Other")) |>
      group_by(grouping_var) %>%
      summarize(n = sum(n)) %>%
      arrange(n)
  }
)
# A tibble: 2 × 13
  expression                    min median itr/s…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴ result  
  <bch:expr>                 <bch:> <bch:>   <dbl> <bch:b>   <dbl> <int> <dbl> <bch:t> <list>  
1 factor_then_sum            10.9ms 11.6ms    85.7    37KB    4.29    40     2   467ms <tibble>
2 sum_then_factor_then_resum 20.5ms 21.5ms    46.5  67.5KB    4.43    21     2   452ms <tibble>
# … with 3 more variables: memory <list>, time <list>, gc <list>, and abbreviated variable
#   names ¹​`itr/sec`, ²​mem_alloc, ³​`gc/sec`, ⁴​total_time
# ℹ Use `colnames()` to see all variable names
2 Likes

Okay, that's close enough to work! The first option lumped a sum total for "Other" including all years together. I want an "Other" count for each year. The second option also gives us the same grand total for "Other", but it just needs an update to the if_else, followed eventually by a separate and then we have arrived at our desired solution.

Thank you for your time and attention!

sample_data %>%
  group_by(group, year) %>%
  count() %>%
  mutate(grouping_var = if_else(n >= 5,
                                paste0(group, "-", year),
                                paste0("Other-",year))) %>%
  group_by(grouping_var) %>%
  summarize(n = sum(n)) %>%
  ungroup() %>%
  separate(
      col = grouping_var,
      into = c("group", "year"),
      sep = "-"
    )
#> # A tibble: 17 × 3
#>    group  year      n
#>    <chr>  <chr> <int>
#>  1 apple  2001      5
#>  2 apple  2003      5
#>  3 apple  2005      8
#>  4 berry  2001      5
#>  5 berry  2003      7
#>  6 berry  2004      5
#>  7 carrot 2002      8
#>  8 carrot 2003      5
#>  9 carrot 2004     10
#> 10 fig    2001      5
#> 11 fig    2002      6
#> 12 fig    2004      6
#> 13 Other  2001      4
#> 14 Other  2002      5
#> 15 Other  2003      4
#> 16 Other  2004      1
#> 17 Other  2005     11

Created on 2023-03-13 with reprex v2.0.2

And thank you for sharing the benchmarking!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.