How to calculate mean, standard deviation and number of samples from a data frame

Hi, friends, I have a dataframe like this:

Bacteria test control control control control test
clade_name ERR275252_profile ERR260268_profile ERR260265_profile ERR260264_profile ERR260263_profile ERR260261_profile
Actinomyces_odontolyticus 0 0 0.00341 0 0.03155 0
Bifidobacterium_adolescentis 0.26989 0.21046 0 2.00023 1.24158 0.9991
Bifidobacterium_bifidum 0 0.07668 0 0.33464 0 0
Bifidobacterium_longum 2.46071 1.68152 0 1.76625 1.91239 0
Bifidobacterium_pseudocatenulatum 0 0 0 0 0 0
Collinsella_aerofaciens 3.91749 1.27748 0.68346 3.54635 3.45814 1.35682
Collinsella_intestinalis 0.01625 0 0.00148 0.01948 0.00618 0.00277
Collinsella_stercoris 0.06886 0.01677 0.01178 0.07679 0.08548 0.0159

The first column has the bacteria names and the other columns shows the type of sample: either control or obese. Now, first I want to transform the data into arcsine-square root and then to find out mean, standard deviation, P-value,and number of samples of each of the control and obese group for each bacteria. I am expecting a new dataframe like this:

Bacteria number_control number_test mean_control mean_test sd_control sd_obese P_value
Actinomyces_odontolyticus
Bifidobacterium_adolescentis
Bifidobacterium_bifidum
Bifidobacterium_longum
Bifidobacterium_pseudocatenulatum
Collinsella_aerofaciens
Collinsella_intestinalis
Collinsella_stercoris

Can anyone please help me?

Thanks in advance

1 Like
  1. Is it okay that 2 of your columns are called test and 4 of them are called control?
  2. Could you please explain what the first row of the data is?
  3. Also, if we assume that your dataset is my_data, it's better for you to post the result of dput(my_data), so that we can be able to easily use it to try to help you. The data, as posted, cannot be easily used by a potential helper.
1 Like

Agreeing with gueynono on dput.
But you also need to

  1. read about tidy data structures
    https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
  2. not encode data (test vs control) in headers - this should be a variable (column)
  3. not encode data (clade_name) in rows - make each row (other than headers/variable names row) a single observation on a single sample at one point in time. Clade_name probably needs to be a variable
  4. use pivot_longer to tidy up your dataset.
2 Likes
  1. Actually, they are not column names. But I want them to use as column names. When I used them as column names they arranged like: control, conltrol.1, control.2, test, test.1, test2, etc....
  2. The first row is nothing but the individual names for the individual control and test samples. You can completely ignore this row.
  3. please chexk it:
> dput(my_data)
structure(list(test = c("test", "0", "0.26989", "0", "2.46071", 
"0", "3.91749", "0.01625", "0.06886", "0.00453"), control = c("control", 
"0", "0.21046", "0.07668", "1.68152", "0", "1.27748", "0", "0.01677", 
"0"), control.1 = c("control", "0.00341", "0", "0", "0", "0", 
"0.68346", "0.00148", "0.01178", "0.00527"), control.2 = c("control", 
"0", "2.00023", "0.33464", "1.76625", "0", "3.54635", "0.01948", 
"0.07679", "0.01215"), control.3 = c("control", "0.03155", "1.24158", 
"0", "1.91239", "0", "3.45814", "0.00618", "0.08548", "0.01395"
), test.1 = c("test", "0", "0.9991", "0", "0", "0", "1.35682", 
"0.00277", "0.0159", "0.00097")), row.names = c("type", "Actinomyces_odontolyticus", 
"Bifidobacterium_adolescentis", "Bifidobacterium_bifidum", "Bifidobacterium_longum", 
"Bifidobacterium_pseudocatenulatum", "Collinsella_aerofaciens", 
"Collinsella_intestinalis", "Collinsella_stercoris", "Enorma_massiliensis"
), class = "data.frame")

Thanks for your response @gueyenono

The dput is character data. For the original layout, never use a double header—save that for reports as noted by @gueyenono

Here's a reprex FAQ minimal reproducible example which illustrates the need to rethink the data layout along tidy lines identified by @phiggins

mk_arcsq <- function(x) asin(sqrt(x[, 2:length(x)] / 100))

# OP's dput was character

DF <- structure(list(clade_name = c(
  "Actinomyces_odontolyticus", "Bifidobacterium_adolescentis",
  "Bifidobacterium_bifidum", "Bifidobacterium_longum", "Bifidobacterium_pseudocatenulatum",
  "Collinsella_aerofaciens", "Collinsella_intestinalis", "Collinsella_stercoris"
), ERR275252_profile = c(
  0, 0.26989, 0, 2.46071, 0, 3.91749,
  0.01625, 0.06886
), ERR260268_profile = c(
  0, 0.21046, 0.07668,
  1.68152, 0, 1.27748, 0, 0.01677
), ERR260265_profile = c(
  0.00341,
  0, 0, 0, 0, 0.68346, 0.00148, 0.01178
), ERR260264_profile = c(
  0,
  2.00023, 0.33464, 1.76625, 0, 3.54635, 0.01948, 0.07679
), ERR260263_profile = c(
  0.03155,
  1.24158, 0, 1.91239, 0, 3.45814, 0.00618, 0.08548
), ERR260261_profile = c(
  0,
  0.9991, 0, 0, 0, 1.35682, 0.00277, 0.0159
)), class = c(
  "spec_tbl_df",
  "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), spec = structure(list(
  cols = list(clade_name = structure(list(), class = c(
    "collector_character",
    "collector"
  )), ERR275252_profile = structure(list(), class = c(
    "collector_double",
    "collector"
  )), ERR260268_profile = structure(list(), class = c(
    "collector_double",
    "collector"
  )), ERR260265_profile = structure(list(), class = c(
    "collector_double",
    "collector"
  )), ERR260264_profile = structure(list(), class = c(
    "collector_double",
    "collector"
  )), ERR260263_profile = structure(list(), class = c(
    "collector_double",
    "collector"
  )), ERR260261_profile = structure(list(), class = c(
    "collector_double",
    "collector"
  ))), default = structure(list(), class = c(
    "collector_guess",
    "collector"
  )), skip = 1L
), class = "col_spec"))

# sq
DF[, 2:length(DF)] <- mk_arcsq(DF)

o <- DF[1,]

pander::pander(o)
Table continues below
clade_name ERR275252_profile ERR260268_profile
Actinomyces_odontolyticus 0 0

Table continues below

ERR260265_profile ERR260264_profile ERR260263_profile ERR260261_profile
0.00584 0 0.01776 0

Created on 2021-01-01 by the reprex package (v0.3.0.9001)

The last line illustrates the need to pivot_longer as suggested by @phiggins — if the bacterium is the observation, the corresponding vector of ERR*profile values will not provide informative statistics.

1 Like

If you have control of how your data are arranged, you probably want to end up with 4 columns of data,
and a csv table that would look something like this:

sample_id, bacterium, arm, value,
001, "Actinomyces_odontolyticus", test, 0,
002, "Actinomyces_odontolyticus", control, 0.29689,
003, "Actinomyces_odontolyticus", test, 1.766,
004, "Actinomyces_odontolyticus", control, 2.567
001, "Bifidobacterium_adolescentis", test, 1.765,
002, "Bifidobacterium_adolescentis", control, 4.352,
003, "Bifidobacterium_adolescentis", test, 0.432,
004, "Bifidobacterium_adolescentis", control, 0.011

which would produce a very tall and tidy dataset.
This will help make your calculations much easier

3 Likes

Hi @technocrat @gueyenono @phiggins - I suspect whether I really understood your suggestion. I am sorry. I am still trying to understand to my best. Here I have arranged my dataframe probably in a better way. Please let me know if this is what you want:

data.frame(
                   stringsAsFactors = FALSE,
                          row.names = c("test","control","control.1",
                                        "control.2","control.3","test.1"),
                               type = c("test","control","control",
                                        "control","control","test"),
          Actinomyces_odontolyticus = c("0","0","0.00341","0","0.03155",
                                        "0"),
       Bifidobacterium_adolescentis = c("0.26989","0.21046","0","2.00023",
                                        "1.24158","0.9991"),
            Bifidobacterium_bifidum = c("0","0.07668","0","0.33464","0",
                                        "0"),
             Bifidobacterium_longum = c("2.46071","1.68152","0","1.76625",
                                        "1.91239","0"),
  Bifidobacterium_pseudocatenulatum = c("0", "0", "0", "0", "0", "0"),
            Collinsella_aerofaciens = c("3.91749","1.27748","0.68346",
                                        "3.54635","3.45814","1.35682"),
           Collinsella_intestinalis = c("0.01625","0","0.00148","0.01948",
                                        "0.00618","0.00277"),
              Collinsella_stercoris = c("0.06886","0.01677","0.01178",
                                        "0.07679","0.08548","0.0159"),
                Enorma_massiliensis = c("0.00453","0","0.00527","0.01215",
                                        "0.01395","0.00097")

)

I have generated this with datapasta::df_paste.

Thanks

Close, but the values need to be numeric (given below).

Are the values already arcsine rooted?

Are you looking for summary statistics by row with this data frame?

DF <- structure(list(type = c(
  "test", "control", "control", "control",
  "control", "test"
), Actinomyces_odontolyticus = c(
  0, 0, 0.00341,
  0, 0.03155, 0
), Bifidobacterium_adolescentis = c(
  0.26989, 0.21046,
  0, 2.00023, 1.24158, 0.9991
), Bifidobacterium_bifidum = c(
  0,
  0.07668, 0, 0.33464, 0, 0
), Bifidobacterium_longum = c(
  2.46071,
  1.68152, 0, 1.76625, 1.91239, 0
), Bifidobacterium_pseudocatenulatum = c(
  0,
  0, 0, 0, 0, 0
), Collinsella_aerofaciens = c(
  3.91749, 1.27748,
  0.68346, 3.54635, 3.45814, 1.35682
), Collinsella_intestinalis = c(
  0.01625,
  0, 0.00148, 0.01948, 0.00618, 0.00277
), Collinsella_stercoris = c(
  0.06886,
  0.01677, 0.01178, 0.07679, 0.08548, 0.0159
), Enorma_massiliensis = c(
  0.00453,
  0, 0.00527, 0.01215, 0.01395, 0.00097
)), class = "data.frame", row.names = c(
  "test",
  "control", "control.1", "control.2", "control.3", "test.1"
))
1 Like

Thanks @technocrat for your kind response.

No, the values are not arcsine square rooted.

I need exactly an output like this:

Bacteria	number_control	number_test	mean_control	mean_test	sd_control	sd_obese	P_value
Actinomyces_odontolyticus							
Bifidobacterium_adolescentis							
Bifidobacterium_bifidum							
Bifidobacterium_longum							
Bifidobacterium_pseudocatenulatum							
Collinsella_aerofaciens							
Collinsella_intestinalis							
Collinsella_stercoris

I need summary statistics of each of the column name (Bacteria) with respect to groups (control, test) specified in type column.

Thanks,
DC

I'm still unclear. For this bug, there are the six numeric values. Given those, you want N =6, mu = 0.005826667, sd = 0.01267541? Where does the other output come from? Are you looking for within bug or among bug measures?

             type Actinomyces_odontolyticus
test         test                   0.00000
control   control                   0.00000
control.1 control                   0.00341
control.2 control                   0.00000
control.3 control                   0.03155
test.1       test                   0.00000
1 Like

@technocrat you are calculating one mean , one SD and one N for each Bacteria. But, I need two means, two SDs, two Ns for each bacteria. One mean for control group and one mean for test group. Similarly one SD for control and one SD for test. And, also, one N for control and one N for test.
Finally, P-value will be only one that we can generate using any parametric test.

Bacteria	               N_control	N_test	mean_control	mean_test	sd_control	sd_test	P_value

Actinomyces_odontolyticus							
Bifidobacterium_adolescentis							
Bifidobacterium_bifidum							
Bifidobacterium_longum							
Bifidobacterium_pseudocatenulatum							
Collinsella_aerofaciens							
Collinsella_intestinalis							
Collinsella_stercoris

Please let me know if it is still not clear, I will try to elaborate.

Thanks,
DC7

Questions are harder than answers, especially when trying to frame them for people from outside one's own domain.

The way in which DF is now organized lends itself to questions like

What are the results of computations on vectors like this: 0.00000 0.00000 0.00341 0.00000 0.03155 0.00000, representing a single species?

and not

What are the results of computations on each element of that vector columnwise, aggregating observations from each species?

That's what @phiggins was getting at with pivot_longer.

2 Likes

OK, let's imagine a pilot experiment with 5 samples from 5 different people.
2 are in the test arm, and 3 in the control arm.

You need to structure your data to look like this
(this is absolutely essential):

library(tidyverse)

dframe <- tribble(
  ~sample_id, ~clade, ~arm, ~value,
  001, "Actinomyces_odontolyticus", "test", 0.47,
  002, "Actinomyces_odontolyticus", "control", 1.67,
  003, "Actinomyces_odontolyticus", "control", 3.42,
  004, "Actinomyces_odontolyticus", "control", 4.21,
  005, "Actinomyces_odontolyticus", "test", 0.89,
  001, "Bifidobacterium_adolescentis", "test", 7.34,
  002, "Bifidobacterium_adolescentis", "control", 1.67,
  003, "Bifidobacterium_adolescentis", "control", 3.42,
  004, "Bifidobacterium_adolescentis", "control", 2.29,
  005, "Bifidobacterium_adolescentis", "test", 5.29,
  001, "Bifidobacterium_bifidum", "test", 6.47,
  002, "Bifidobacterium_bifidum", "control", 1.67,
  003, "Bifidobacterium_bifidum", "control", 1.22,
  004, "Bifidobacterium_bifidum", "control", 0.83,
  005, "Bifidobacterium_bifidum", "test", 4.83,
  001, "Bifidobacterium_longum", "test", 9.47,
  002, "Bifidobacterium_longum", "control", 2.67,
  003, "Bifidobacterium_longum", "control", 3.22,
  004, "Bifidobacterium_longum", "control", 1.83,
  005, "Bifidobacterium_longum", "test", 7.83)

Now you have tidy data that you can work with.

Now this becomes easy.
You were trying to do something that is very hard.
It is an important concept - you can waste a lot of time trying to do things that are very hard, or you can restructure your data and do things easily (and get lots of help easily).
Having your data in a tidy structure makes your life much easier.

Now that your data are tidy, let's take a look at the dataset.

dframe
#> # A tibble: 20 x 4
#>    sample_id clade                        arm     value
#>        <dbl> <chr>                        <chr>   <dbl>
#>  1         1 Actinomyces_odontolyticus    test     0.47
#>  2         2 Actinomyces_odontolyticus    control  1.67
#>  3         3 Actinomyces_odontolyticus    control  3.42
#>  4         4 Actinomyces_odontolyticus    control  4.21
#>  5         5 Actinomyces_odontolyticus    test     0.89
#>  6         1 Bifidobacterium_adolescentis test     7.34
#>  7         2 Bifidobacterium_adolescentis control  1.67
#>  8         3 Bifidobacterium_adolescentis control  3.42
#>  9         4 Bifidobacterium_adolescentis control  2.29
#> 10         5 Bifidobacterium_adolescentis test     5.29
#> 11         1 Bifidobacterium_bifidum      test     6.47
#> 12         2 Bifidobacterium_bifidum      control  1.67
#> 13         3 Bifidobacterium_bifidum      control  1.22
#> 14         4 Bifidobacterium_bifidum      control  0.83
#> 15         5 Bifidobacterium_bifidum      test     4.83
#> 16         1 Bifidobacterium_longum       test     9.47
#> 17         2 Bifidobacterium_longum       control  2.67
#> 18         3 Bifidobacterium_longum       control  3.22
#> 19         4 Bifidobacterium_longum       control  1.83
#> 20         5 Bifidobacterium_longum       test     7.83

Or you can use glimpse

Rows: 20
Columns: 4
$ sample_id <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4…
$ clade     <chr> "Actinomyces_odontolyticus", "Actinomyces_odontolyticus…
$ arm       <chr> "test", "control", "control", "control", "test", "test"…
$ value     <dbl> 0.47, 1.67, 3.42, 4.21, 0.89, 7.34, 1.67, 3.42, 2.29, 5…

Each row is an observation, each column is a unique variable.
No variables occur twice, and each cell contains one piece of information.

Now we can get this done quickly.

  group_by(clade, arm) %>% 
  summarize(mean = mean(value),
            sd = sd(value),
            count = n()) 

gives you:

summarise() regrouping output by 'clade' (override with `.groups` argument)
 A tibble: 8 x 5
Groups:   clade [4]
  clade                        arm      mean    sd count
  <chr>                        <chr>   <dbl> <dbl> <int>
1 Actinomyces_odontolyticus    control  3.1  1.30      3
2 Actinomyces_odontolyticus    test     0.68 0.297     2
3 Bifidobacterium_adolescentis control  2.46 0.887     3
4 Bifidobacterium_adolescentis test     6.32 1.45      2
5 Bifidobacterium_bifidum      control  1.24 0.420     3
6 Bifidobacterium_bifidum      test     5.65 1.16      2
7 Bifidobacterium_longum       control  2.57 0.700     3
8 Bifidobacterium_longum       test     8.65 1.16      2

Which is quick and fabulous, and easy to compare test vs control for each clade.

If you insist on the particular format you asked for, you would have to add a pivot_wider function,
which is a bit more complicated, but looks like

dframe %>% 
  group_by(clade, arm) %>% 
  summarize(mean = mean(value),
            sd = sd(value),
            count = n()) %>% 
  pivot_wider(id_cols = c(clade, arm),
              names_from = arm, 
              names_sep = "_",
              values_from = mean:count)

and produces this output:

`summarise()` regrouping output by 'clade' (override with `.groups` argument)
 A tibble: 4 x 7
 Groups:   clade [4]

  clade   mean_control mean_test sd_control sd_test count_control count_test
  <chr>          <dbl>     <dbl>      <dbl>   <dbl>         <int>      <int>
1 Actino…         3.1       0.68      1.30    0.297             3          2
2 Bifido…         2.46      6.32      0.887   1.45              3          2
3 Bifido…         1.24      5.65      0.420   1.16              3          2
4 Bifido…         2.57      8.65      0.700   1.16              3          2

However you learned (or are learning) R, it is rarely emphasized enough how important it is to get your data into a tidy data structure before you start to do any analysis.

I hope this helps.

This may create questions about how to restructure your data - this will probably require use of the tidyr package and particularly the pivot_longer function.
Once you can restructure your data into tidy format, **everything ** becomes easier.

See separate post below on wrangling your data from the dput() output.

3 Likes

On the separate issue of how to clean up and restructure your data, as produced by the dput statement,
This approach should work.
But if you can better control your input data (avoiding columns with the same name, putting them in a tidy structure in the first place, you will be better off.

library(tidyverse)
# assign dput statement to the object df2
df2 <- structure(list(test = c("test", "0", "0.26989", "0", "2.46071", 
                        "0", "3.91749", "0.01625", "0.06886", "0.00453"), control = c("control", 
                                                                                      "0", "0.21046", "0.07668", "1.68152", "0", "1.27748", "0", "0.01677", 
                                                                                      "0"), control.1 = c("control", "0.00341", "0", "0", "0", "0", 
                                                                                                          "0.68346", "0.00148", "0.01178", "0.00527"), control.2 = c("control", 
                                                                                                                                                                     "0", "2.00023", "0.33464", "1.76625", "0", "3.54635", "0.01948", 
                                                                                                                                                                     "0.07679", "0.01215"), control.3 = c("control", "0.03155", "1.24158", 
                                                                                                                                                                                                          "0", "1.91239", "0", "3.45814", "0.00618", "0.08548", "0.01395"
                                                                                                                                                                     ), test.1 = c("test", "0", "0.9991", "0", "0", "0", "1.35682", 
                                                                                                                                                                                   "0.00277", "0.0159", "0.00097")), row.names = c("type", "Actinomyces_odontolyticus", 
                                                                                                                                                                                                                                   "Bifidobacterium_adolescentis", "Bifidobacterium_bifidum", "Bifidobacterium_longum", 
                                                                                                                                                                                                                                   "Bifidobacterium_pseudocatenulatum", "Collinsella_aerofaciens", 
                                                                                                                                                                                                                                   "Collinsella_intestinalis", "Collinsella_stercoris", "Enorma_massiliensis"
                                                                                                                                                                                   ), class = "data.frame")

# now restructure the data
df2 %>% 
  slice(2:10) %>% # cut out the extra header row in row 1
  rownames_to_column(var = "clade") %>% # make clade a variable instead of a rowname
  pivot_longer(cols = test:test.1, # pivot longer to clean up
               names_to = "arm", values_to = "value") %>% 
  mutate(sample_id = rep(1:6, 9)) %>% # add sample ids, assumes 6 samples, 9 clades
  mutate(arm = case_when(str_detect(arm, "test") ~ "test", # fix the test vs test.1
                         str_detect(arm, "control") ~ "control")) %>%  # fix the multiple names for control.2
  relocate(sample_id, clade) # rearrange variable names

which produces this result:

 A tibble: 54 x 4
   sample_id clade                        arm     value  
       <int> <chr>                        <chr>   <chr>  
 1         1 Actinomyces_odontolyticus    test    0      
 2         2 Actinomyces_odontolyticus    control 0      
 3         3 Actinomyces_odontolyticus    control 0.00341
 4         4 Actinomyces_odontolyticus    control 0      
 5         5 Actinomyces_odontolyticus    control 0.03155
 6         6 Actinomyces_odontolyticus    test    0      
 7         1 Bifidobacterium_adolescentis test    0.26989
 8         2 Bifidobacterium_adolescentis control 0.21046
 9         3 Bifidobacterium_adolescentis control 0      
10         4 Bifidobacterium_adolescentis control 2.00023
# … with 44 more rows
2 Likes

Note that this is still not quite ready for analysis (though it is tidy), because the value column was entered as character.
It needs to be numeric (try mutate and as.numeric) to do the analysis pipeline above.

And (apparently?) you may need to do an arcsin-root transformation on the values, but that can be done with a mutate step.

But if the data are coming from a NextGenSequencing machine, they will come out in the same format every time. And if you can figure out a data wrangling sequence of steps to make these data tidy, you can use the same code every time (and maybe turn it into a function). It will work great (until you switch to a new machine with a new format).

If microbiome analysis is your thing, take a look at the coding modules at Riffomonas.org
http://www.riffomonas.org/code_club/setup-instructions
They are very helpful, and have many microbiome-analysis-specific examples.

Now putting the whole thing together from datapasta.
The data wrangling is easier from the datapasta result. Just assign this to dframe as below.

library(tidyverse)

dframe <- data.frame(
  stringsAsFactors = FALSE,
  row.names = c("test","control","control.1",
                "control.2","control.3","test.1"),
  type = c("test","control","control",
           "control","control","test"),
  Actinomyces_odontolyticus = c("0","0","0.00341","0","0.03155",
                                "0"),
  Bifidobacterium_adolescentis = c("0.26989","0.21046","0","2.00023",
                                   "1.24158","0.9991"),
  Bifidobacterium_bifidum = c("0","0.07668","0","0.33464","0",
                              "0"),
  Bifidobacterium_longum = c("2.46071","1.68152","0","1.76625",
                             "1.91239","0"),
  Bifidobacterium_pseudocatenulatum = c("0", "0", "0", "0", "0", "0"),
  Collinsella_aerofaciens = c("3.91749","1.27748","0.68346",
                              "3.54635","3.45814","1.35682"),
  Collinsella_intestinalis = c("0.01625","0","0.00148","0.01948",
                               "0.00618","0.00277"),
  Collinsella_stercoris = c("0.06886","0.01677","0.01178",
                            "0.07679","0.08548","0.0159"),
  Enorma_massiliensis = c("0.00453","0","0.00527","0.01215",
                          "0.01395","0.00097"))
 
# now some data wrangling to make this tidy
dframe %>% 
  pivot_longer(cols = Actinomyces_odontolyticus:Enorma_massiliensis,
               names_to = "clade") %>% 
  mutate(sample_id = c(rep(1,9), rep(2,9), rep(3,9),  # add sample_id
                       rep(4,9), rep(5,9), rep(6,9))) %>% 
  rename(arm = type) %>%      # correct arm name
  mutate(value = as.numeric(value)) %>%   # make numeric
  relocate(sample_id, clade, arm, value) -> # reorganize variable order
dframe_long

# now summary
dframe_long %>% 
  group_by(clade, arm) %>% 
  summarize(mean = mean(value, na.rm = TRUE),
            sd = sd(value),
            count = n()) %>% 
  pivot_wider(id_cols = c(clade, arm),
              names_from = arm, 
              names_sep = "_",
              values_from = mean:count) 
#> `summarise()` regrouping output by 'clade' (override with `.groups` argument)
#> # A tibble: 9 x 7
#> # Groups:   clade [9]
#>   clade       mean_control mean_test sd_control sd_test count_control count_test
#>   <chr>              <dbl>     <dbl>      <dbl>   <dbl>         <int>      <int>
#> 1 Actinomyce…      0.00874   0          0.0153  0                   4          2
#> 2 Bifidobact…      0.863     0.634      0.932   0.516               4          2
#> 3 Bifidobact…      0.103     0          0.159   0                   4          2
#> 4 Bifidobact…      1.34      1.23       0.898   1.74                4          2
#> 5 Bifidobact…      0         0          0       0                   4          2
#> 6 Collinsell…      2.24      2.64       1.48    1.81                4          2
#> 7 Collinsell…      0.00678   0.00951    0.00886 0.00953             4          2
#> 8 Collinsell…      0.0477    0.0424     0.0388  0.0374              4          2
#> 9 Enorma_mas…      0.00784   0.00275    0.00643 0.00252             4          2

Created on 2021-01-02 by the reprex package (v0.3.0)

2 Likes

Thanks a lot @phiggins for your immensely helpful answers. I will rather call it a tutorial which is really useful for novice R-learner like me. I am really indebted to you. At the same time I will also like to thank @technocrat for his/her helpful posts.
Anyway, I have made very few changes to the code I have marked as solution as my table is big enough. Will you like to see my original table and check if my code is correct or still some mistakes exist?
This is my original input table that is to be imported in R:

please click this link

This is my final code:

library(dplyr)
library(stringr)
library(tidyverse)
library(tibble)

my_data <- read.table("/mnt/data/downloads/merge_abundance_species_5.tsv")
rownames(my_data) <- my_data[,1]
colnames(my_data) <- my_data[1,]
colnames(my_data) <- make.unique(as.character(my_data[1,]))
my_data <- my_data[-c(1,2), -c(1)]

mk_arcsq <- function(x) asin(sqrt(x[, 4] / 100))


Many many thanks,
DC
# now restructure the data
dframe <- my_data %>% 
  #slice(2:length(rownames(my_data))) %>% # cut out the extra header row in row 1
  rownames_to_column(var = "clade") %>% # make clade a variable instead of a rowname
  pivot_longer(cols = !clade, # pivot longer to clean up
               names_to = "arm", values_to = "value")%>% 
  mutate(sample_id = rep(1:length(my_data), length(rownames(my_data)))) %>% # add sample ids, assumes 52 samples, 218 clades
  mutate(value = as.numeric(value)) %>%   # make numeric
  mutate(arm = case_when(str_detect(arm, "test") ~ "test", # fix the test vs test.1
                         str_detect(arm, "control") ~ "control")) %>%  # fix the multiple names for control.2
  relocate(sample_id, clade) # rearrange variable names

dframe[,4] <- mk_arcsq(dframe)


output_2 <- dframe %>% 
  group_by(clade, arm) %>% 
  summarize(mean = mean(value),
            sd = sd(value),
            count = n()) %>% 
  pivot_wider(id_cols = c(clade, arm),
              names_from = arm, 
              names_sep = "_",
              values_from = mean:count)
1 Like

I think I understand now - there are 52 samples.
Is the 2nd row essentially sample_identifiers? (unique id for each sample)?

2 Likes

Yes @phiggins. You understood correctly.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.