group_by()
interpreted those bare values as references to columns, and then created columns to match. Try something like:
by_Group1 <- SampleData_trim %>%
mutate(group_id = if_else(SampleID %in% c(454,3,554,202,531,18,681,423), 1, 2)) %>%
group_by(group_id)
(the above code only works if your sample Id column is actually called SampleID
and if its values are numeric. If it's a column of character values, you'll need if_else(SampleID %in% c("454","3","554","202","531","18","681","423"), 1, 2))
).
To be clear, when I mentioned adding a group designator as a variable, I was talking about the mutate()
step. After that, you can use filter()
to do something with only the rows that have a certain grouping ID, or (as you tried) you can go ahead and group_by()
that variable, so that summary functions will be applied per group.
One method involving splitting into separate data frames looks like this:
library(tidyverse)
# Construct an example data frame reproducibly
set.seed(42)
dfr <- data.frame(
sample_id = as.character(sample(1:200, size = 10)),
numeric_var = rnorm(10, mean = 10, sd = 2),
categorical_var = factor(sample(LETTERS[1:3], size = 10, replace = TRUE)),
stringsAsFactors = FALSE
)
dfr
#> sample_id numeric_var categorical_var
#> 1 183 9.787751 C
#> 2 187 13.023044 C
#> 3 57 9.810682 B
#> 4 164 14.036847 C
#> 5 126 9.874572 A
#> 6 102 12.609739 C
#> 7 143 14.573291 A
#> 8 26 7.222279 A
#> 9 127 9.442422 C
#> 10 135 9.733357 B
# If we already know which samples belong in Group 1
group_1 <- c("57", "126", "143", "135")
# Adding the group as a variable, tidyverse-style
dfr <- mutate(dfr, group_id = if_else(sample_id %in% group_1, 1, 2))
dfr
#> sample_id numeric_var categorical_var group_id
#> 1 183 9.787751 C 2
#> 2 187 13.023044 C 2
#> 3 57 9.810682 B 1
#> 4 164 14.036847 C 2
#> 5 126 9.874572 A 1
#> 6 102 12.609739 C 2
#> 7 143 14.573291 A 1
#> 8 26 7.222279 A 2
#> 9 127 9.442422 C 2
#> 10 135 9.733357 B 1
# Here's how you'd do it in base R
# dfr$group_id <- ifelse(dfr$sample_id %in% group_1, 1, 2)
Splitting into a list of data frames
# Splitting on `group_id`
split(dfr, dfr$group_id) %>% set_names(c("group1", "group2"))
#> $group1
#> sample_id numeric_var categorical_var group_id
#> 3 57 9.810682 B 1
#> 5 126 9.874572 A 1
#> 7 143 14.573291 A 1
#> 10 135 9.733357 B 1
#>
#> $group2
#> sample_id numeric_var categorical_var group_id
#> 1 183 9.787751 C 2
#> 2 187 13.023044 C 2
#> 4 164 14.036847 C 2
#> 6 102 12.609739 C 2
#> 8 26 7.222279 A 2
#> 9 127 9.442422 C 2
# The output of `split()` is a list, with each element containing a data frame.
# You can leave it that way...
dfr_list <- split(dfr, dfr$group_id) %>% set_names(c("group1", "group2"))
# ...which makes it easy to apply subsequent code to both data frames in parallel
map_dbl(dfr_list, ~mean(.$numeric_var))
#> group1 group2
#> 10.99798 11.02035
# base R version
# unlist(lapply(dfr_list, function(x) { mean(x$numeric_var)}))
# You can access an individual data frame in the list like:
dfr_list$group1
#> sample_id numeric_var categorical_var group_id
#> 3 57 9.810682 B 1
#> 5 126 9.874572 A 1
#> 7 143 14.573291 A 1
#> 10 135 9.733357 B 1
# Or dfr_list[["group1"]] or dfr_list[[1]]
Created on 2018-11-22 by the reprex package (v0.2.1)
Which pattern makes sense for you depends on what you're doing downstream. There are other possibilities, too! A whole lot of related discussion and resources can be found in these topics: