Hi,
I have a question on adding counts of the same sample or column in the dataframe. I have a very large dataset and would like to add counts of the same sample. The dataframe has several duplicate samples (for instance sample_1, sample_3 etc) and unique sample (sample_2, sample_5 etc). For unique samples the data would be retained as it is without any calculations, however, I would like to take sum of each duplicate samples/columns in the dataframe and rename that sample to the original sample name. The samples are on columns and features or entities on rows.
One way is to just transpose the dataframe and calculate the sum of the duplicate samples, and then re-transpose back. But I am curious to know if there is a way to sum by duplicate column names instead of transposing.
I have included the some example inout and expected data below:
dput(Input)
structure(list(Geneid = c("ENSG00000223972", "ENSG00000227232",
"ENSG00000278267", "ENSG00000243485", "ENSG00000237613", "ENSG00000268020",
"ENSG00000240361", "ENSG00000186092"), Sample_1 = c(0L, 22L,
1L, 0L, 0L, 4L, 0L, 6L), Sample_1 = c(1L, 27L, 2L, 0L, 0L, 2L,
0L, 1L), Sample_2 = c(0L, 0L, 1L, 0L, 44L, 4L, 4L, 11L), Sample_3 = c(1L,
222L, 112L, 10L, 11L, 4L, 1L, 6L), Sample_3 = c(0L, 22L, 1L,
0L, 42L, 76L, 66L, 6L), Sample_4 = c(440L, 440L, 440L, 0L, 42L,
0L, 0L, 6L), Sample_4 = c(840L, 0L, 10L, 0L, 42L, 0L, 0L, 6L),
Sample_5 = c(0L, 0L, 10L, 0L, 342L, 0L, 40L, 6L)), class = "data.frame", row.names = c(NA,
-8L))
#> Geneid Sample_1 Sample_1 Sample_2 Sample_3 Sample_3 Sample_4
#> 1 ENSG00000223972 0 1 0 1 0 440
#> 2 ENSG00000227232 22 27 0 222 22 440
#> 3 ENSG00000278267 1 2 1 112 1 440
#> 4 ENSG00000243485 0 0 0 10 0 0
#> 5 ENSG00000237613 0 0 44 11 42 42
#> 6 ENSG00000268020 4 2 4 4 76 0
#> 7 ENSG00000240361 0 0 4 1 66 0
#> 8 ENSG00000186092 6 1 11 6 6 6
#> Sample_4 Sample_5
#> 1 840 0
#> 2 0 0
#> 3 10 10
#> 4 0 0
#> 5 42 342
#> 6 0 0
#> 7 0 40
#> 8 6 6
dput(Expected_Output)
structure(list(Geneid = c("ENSG00000223972", "ENSG00000227232",
"ENSG00000278267", "ENSG00000243485", "ENSG00000237613", "ENSG00000268020",
"ENSG00000240361", "ENSG00000186092"), Sample_1 = c(1L, 49L,
3L, 0L, 0L, 6L, 0L, 7L), Sample_2 = c(0L, 0L, 1L, 0L, 44L, 4L,
4L, 11L), Sample_3 = c(1L, 244L, 113L, 10L, 53L, 80L, 67L, 12L
), Sample_4 = c(1280L, 440L, 450L, 0L, 84L, 0L, 0L, 12L), Sample_5 = c(0L,
0L, 10L, 0L, 342L, 0L, 40L, 6L)), class = "data.frame", row.names = c(NA,
-8L))
#> Geneid Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
#> 1 ENSG00000223972 1 0 1 1280 0
#> 2 ENSG00000227232 49 0 244 440 0
#> 3 ENSG00000278267 3 1 113 450 10
#> 4 ENSG00000243485 0 0 10 0 0
#> 5 ENSG00000237613 0 44 53 84 342
#> 6 ENSG00000268020 6 4 80 0 0
#> 7 ENSG00000240361 0 4 67 0 40
#> 8 ENSG00000186092 7 11 12 12 6
Created on 2021-09-18 by the reprex package (v2.0.1)
Thank you,
Toufiq