Why can't I choose 2 groups

juandmaz · November 22, 2023, 10:50pm

I have this dataframe

head(df)

# A tibble: 6 × 3
  factor car   speed
   <int> <chr> <chr>
1     15 blue  low  
2      7 blue  fast 
3     11 blue  low  
4      8 red   fast 
5     10 blue  low  
6      7 red   fast 

dput(df)
structure(list(factor = c(15L, 7L, 11L, 8L, 10L, 7L, 4L, 12L, 
6L, 8L, 10L, 2L, 5L, 6L, 12L, 3L, 10L, 1L, 10L, 8L, 8L, 11L, 
7L, 1L, 3L, 9L, 7L, 6L, 3L, 4L, 13L, 2L, 7L, 10L, 9L, 13L, 6L, 
1L, 7L, 3L, 12L, 1L, 6L, 6L, 4L, 13L, 2L, 3L, 12L, 11L), car = c("blue", 
"blue", "blue", "red", "blue", "red", "red", "red", "red", "blue", 
"blue", "blue", "blue", "red", "blue", "red", "blue", "blue", 
"red", "blue", "blue", "red", "red", "red", "red", "blue", "red", 
"blue", "red", "blue", "blue", "blue", "red", "red", "blue", 
"red", "blue", "blue", "red", "blue", "red", "red", "blue", "blue", 
"red", "red", "red", "red", "red", "red"), speed = c("low", "fast", 
"low", "fast", "low", "fast", "fast", "fast", "low", "fast", 
"fast", "low", "fast", "fast", "fast", "fast", "fast", "fast", 
"low", "low", "low", "fast", "low", "low", "fast", "fast", "fast", 
"low", "fast", "low", "low", "fast", "low", "low", "fast", "low", 
"low", "low", "low", "fast", "low", "fast", "fast", "fast", "low", 
"low", "fast", "low", "low", "low")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -50L))

I am grouping the base by factor and then randomly choosing 2 groups that have only 3 rows.
Why doesn't my code work?

df %>%
  group_by(factor) %>%
  filter(n()==3) %>%
  arrange(factor) %>%
  slice_sample(n=2)

# Groups:   factor [4]
  factor car   speed
   <int> <chr> <chr>
1      2 red   fast 
2      2 blue  low  
3      4 red   low  
4      4 red   fast 
5     11 red   low  
6     11 red   fast 
7     13 red   low  
8     13 red   low

There should be 6 rows only and 2 values of factors

AlexisW · November 22, 2023, 11:30pm

> df %>%
+   group_by(factor) %>%
+   filter(n()==3)
# A tibble: 12 × 3
# Groups:   factor [4]
   factor car   speed
    <int> <chr> <chr>
 1     11 blue  low  
 2      4 red   fast 
 3      2 blue  low  
 4     11 red   fast 
 5      4 blue  low  
 6     13 blue  low  
 7      2 blue  fast 
 8     13 red   low  
 9      4 red   low  
10     13 red   low  
11      2 red   fast 
12     11 red   low

At this point, you filtered the rows that belong to a group of 3 rows, that is correct.

At this point, you are still grouped by factor. So, when running slice_sample(), you are asking, within each group, randomly choose 2 rows.

If you want to select entire factors, you have to use slice_sample() on a dataframe where each row is a factor value. First solution, do it separately:

> factors_to_keep <- df %>%
+   group_by(factor) %>%
+   filter(n()==3) |>
+   pull(factor) %>%
+   unique() %>%
+   sample(2)
> 
> factors_to_keep
[1] 2 4
> 
> df %>%
+   filter(factor %in% factors_to_keep)
# A tibble: 6 × 3
  factor car   speed
   <int> <chr> <chr>
1      4 red   fast 
2      2 blue  low  
3      4 blue  low  
4      2 blue  fast 
5      4 red   low  
6      2 red   fast

Or, if you want to do it at once, you can use nesting (and make sure you do the sampling on an ungrouped tibble):

> df %>%
+   group_by(factor) %>%
+   filter(n()==3) %>%
+   nest() %>%
+   ungroup() %>%
+   slice_sample(n = 2) %>%
+   unnest(data)
# A tibble: 6 × 3
  factor car   speed
   <int> <chr> <chr>
1      4 red   fast 
2      4 blue  low  
3      4 red   low  
4     11 blue  low  
5     11 red   fast 
6     11 red   low

juandmaz · November 23, 2023, 12:48am

Hi, thanks for the answer. What does nest() do? I don't fully understand.

williaml · November 23, 2023, 1:47am

Nested data • tidyr (tidyverse.org)

system · November 30, 2023, 1:48am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.