OK, let's imagine a pilot experiment with 5 samples from 5 different people.
2 are in the test arm, and 3 in the control arm.
You need to structure your data to look like this
(this is absolutely essential):
library(tidyverse)
dframe <- tribble(
~sample_id, ~clade, ~arm, ~value,
001, "Actinomyces_odontolyticus", "test", 0.47,
002, "Actinomyces_odontolyticus", "control", 1.67,
003, "Actinomyces_odontolyticus", "control", 3.42,
004, "Actinomyces_odontolyticus", "control", 4.21,
005, "Actinomyces_odontolyticus", "test", 0.89,
001, "Bifidobacterium_adolescentis", "test", 7.34,
002, "Bifidobacterium_adolescentis", "control", 1.67,
003, "Bifidobacterium_adolescentis", "control", 3.42,
004, "Bifidobacterium_adolescentis", "control", 2.29,
005, "Bifidobacterium_adolescentis", "test", 5.29,
001, "Bifidobacterium_bifidum", "test", 6.47,
002, "Bifidobacterium_bifidum", "control", 1.67,
003, "Bifidobacterium_bifidum", "control", 1.22,
004, "Bifidobacterium_bifidum", "control", 0.83,
005, "Bifidobacterium_bifidum", "test", 4.83,
001, "Bifidobacterium_longum", "test", 9.47,
002, "Bifidobacterium_longum", "control", 2.67,
003, "Bifidobacterium_longum", "control", 3.22,
004, "Bifidobacterium_longum", "control", 1.83,
005, "Bifidobacterium_longum", "test", 7.83)
Now you have tidy data that you can work with.
Now this becomes easy.
You were trying to do something that is very hard.
It is an important concept - you can waste a lot of time trying to do things that are very hard, or you can restructure your data and do things easily (and get lots of help easily).
Having your data in a tidy structure makes your life much easier.
Now that your data are tidy, let's take a look at the dataset.
dframe
#> # A tibble: 20 x 4
#> sample_id clade arm value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 Actinomyces_odontolyticus test 0.47
#> 2 2 Actinomyces_odontolyticus control 1.67
#> 3 3 Actinomyces_odontolyticus control 3.42
#> 4 4 Actinomyces_odontolyticus control 4.21
#> 5 5 Actinomyces_odontolyticus test 0.89
#> 6 1 Bifidobacterium_adolescentis test 7.34
#> 7 2 Bifidobacterium_adolescentis control 1.67
#> 8 3 Bifidobacterium_adolescentis control 3.42
#> 9 4 Bifidobacterium_adolescentis control 2.29
#> 10 5 Bifidobacterium_adolescentis test 5.29
#> 11 1 Bifidobacterium_bifidum test 6.47
#> 12 2 Bifidobacterium_bifidum control 1.67
#> 13 3 Bifidobacterium_bifidum control 1.22
#> 14 4 Bifidobacterium_bifidum control 0.83
#> 15 5 Bifidobacterium_bifidum test 4.83
#> 16 1 Bifidobacterium_longum test 9.47
#> 17 2 Bifidobacterium_longum control 2.67
#> 18 3 Bifidobacterium_longum control 3.22
#> 19 4 Bifidobacterium_longum control 1.83
#> 20 5 Bifidobacterium_longum test 7.83
Or you can use glimpse
Rows: 20
Columns: 4
$ sample_id <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4…
$ clade <chr> "Actinomyces_odontolyticus", "Actinomyces_odontolyticus…
$ arm <chr> "test", "control", "control", "control", "test", "test"…
$ value <dbl> 0.47, 1.67, 3.42, 4.21, 0.89, 7.34, 1.67, 3.42, 2.29, 5…
Each row is an observation, each column is a unique variable.
No variables occur twice, and each cell contains one piece of information.
Now we can get this done quickly.
group_by(clade, arm) %>%
summarize(mean = mean(value),
sd = sd(value),
count = n())
gives you:
summarise() regrouping output by 'clade' (override with `.groups` argument)
A tibble: 8 x 5
Groups: clade [4]
clade arm mean sd count
<chr> <chr> <dbl> <dbl> <int>
1 Actinomyces_odontolyticus control 3.1 1.30 3
2 Actinomyces_odontolyticus test 0.68 0.297 2
3 Bifidobacterium_adolescentis control 2.46 0.887 3
4 Bifidobacterium_adolescentis test 6.32 1.45 2
5 Bifidobacterium_bifidum control 1.24 0.420 3
6 Bifidobacterium_bifidum test 5.65 1.16 2
7 Bifidobacterium_longum control 2.57 0.700 3
8 Bifidobacterium_longum test 8.65 1.16 2
Which is quick and fabulous, and easy to compare test vs control for each clade.
If you insist on the particular format you asked for, you would have to add a pivot_wider function,
which is a bit more complicated, but looks like
dframe %>%
group_by(clade, arm) %>%
summarize(mean = mean(value),
sd = sd(value),
count = n()) %>%
pivot_wider(id_cols = c(clade, arm),
names_from = arm,
names_sep = "_",
values_from = mean:count)
and produces this output:
`summarise()` regrouping output by 'clade' (override with `.groups` argument)
A tibble: 4 x 7
Groups: clade [4]
clade mean_control mean_test sd_control sd_test count_control count_test
<chr> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Actino… 3.1 0.68 1.30 0.297 3 2
2 Bifido… 2.46 6.32 0.887 1.45 3 2
3 Bifido… 1.24 5.65 0.420 1.16 3 2
4 Bifido… 2.57 8.65 0.700 1.16 3 2
However you learned (or are learning) R, it is rarely emphasized enough how important it is to get your data into a tidy data structure before you start to do any analysis.
I hope this helps.
This may create questions about how to restructure your data - this will probably require use of the tidyr package and particularly the pivot_longer function.
Once you can restructure your data into tidy format, **everything ** becomes easier.
See separate post below on wrangling your data from the dput() output.