Splitting into two groups

jcblum · November 22, 2018, 9:10pm

group_by() interpreted those bare values as references to columns, and then created columns to match. Try something like:

by_Group1 <- SampleData_trim %>% 
  mutate(group_id = if_else(SampleID %in% c(454,3,554,202,531,18,681,423), 1, 2)) %>%
  group_by(group_id)

(the above code only works if your sample Id column is actually called SampleID and if its values are numeric. If it's a column of character values, you'll need if_else(SampleID %in% c("454","3","554","202","531","18","681","423"), 1, 2))).

To be clear, when I mentioned adding a group designator as a variable, I was talking about the mutate() step. After that, you can use filter() to do something with only the rows that have a certain grouping ID, or (as you tried) you can go ahead and group_by() that variable, so that summary functions will be applied per group.

One method involving splitting into separate data frames looks like this:

library(tidyverse)

# Construct an example data frame reproducibly
set.seed(42)
dfr <- data.frame(
  sample_id = as.character(sample(1:200, size = 10)),
  numeric_var = rnorm(10, mean = 10, sd = 2),
  categorical_var = factor(sample(LETTERS[1:3], size = 10, replace = TRUE)),
  stringsAsFactors = FALSE
)

dfr
#>    sample_id numeric_var categorical_var
#> 1        183    9.787751               C
#> 2        187   13.023044               C
#> 3         57    9.810682               B
#> 4        164   14.036847               C
#> 5        126    9.874572               A
#> 6        102   12.609739               C
#> 7        143   14.573291               A
#> 8         26    7.222279               A
#> 9        127    9.442422               C
#> 10       135    9.733357               B

# If we already know which samples belong in Group 1
group_1 <- c("57", "126", "143", "135")

# Adding the group as a variable, tidyverse-style
dfr <- mutate(dfr, group_id = if_else(sample_id %in% group_1, 1, 2))

dfr
#>    sample_id numeric_var categorical_var group_id
#> 1        183    9.787751               C        2
#> 2        187   13.023044               C        2
#> 3         57    9.810682               B        1
#> 4        164   14.036847               C        2
#> 5        126    9.874572               A        1
#> 6        102   12.609739               C        2
#> 7        143   14.573291               A        1
#> 8         26    7.222279               A        2
#> 9        127    9.442422               C        2
#> 10       135    9.733357               B        1

# Here's how you'd do it in base R
# dfr$group_id <- ifelse(dfr$sample_id %in% group_1, 1, 2)

Splitting into a list of data frames

# Splitting on `group_id`
split(dfr, dfr$group_id) %>% set_names(c("group1", "group2"))
#> $group1
#>    sample_id numeric_var categorical_var group_id
#> 3         57    9.810682               B        1
#> 5        126    9.874572               A        1
#> 7        143   14.573291               A        1
#> 10       135    9.733357               B        1
#> 
#> $group2
#>   sample_id numeric_var categorical_var group_id
#> 1       183    9.787751               C        2
#> 2       187   13.023044               C        2
#> 4       164   14.036847               C        2
#> 6       102   12.609739               C        2
#> 8        26    7.222279               A        2
#> 9       127    9.442422               C        2

# The output of `split()` is a list, with each element containing a data frame.
# You can leave it that way...
dfr_list <- split(dfr, dfr$group_id) %>% set_names(c("group1", "group2"))

# ...which makes it easy to apply subsequent code to both data frames in parallel
map_dbl(dfr_list, ~mean(.$numeric_var))
#>   group1   group2 
#> 10.99798 11.02035

# base R version
# unlist(lapply(dfr_list, function(x) { mean(x$numeric_var)}))

# You can access an individual data frame in the list like:
dfr_list$group1
#>    sample_id numeric_var categorical_var group_id
#> 3         57    9.810682               B        1
#> 5        126    9.874572               A        1
#> 7        143   14.573291               A        1
#> 10       135    9.733357               B        1

# Or dfr_list[["group1"]] or dfr_list[[1]]

^{Created on 2018-11-22 by the reprex package (v0.2.1)}

Which pattern makes sense for you depends on what you're doing downstream. There are other possibilities, too! A whole lot of related discussion and resources can be found in these topics: