How to take sum of duplicate columns in dataframe in R?

mtoufiq · September 18, 2021, 8:19pm

Hi,

I have a question on adding counts of the same sample or column in the dataframe. I have a very large dataset and would like to add counts of the same sample. The dataframe has several duplicate samples (for instance sample_1, sample_3 etc) and unique sample (sample_2, sample_5 etc). For unique samples the data would be retained as it is without any calculations, however, I would like to take sum of each duplicate samples/columns in the dataframe and rename that sample to the original sample name. The samples are on columns and features or entities on rows.

One way is to just transpose the dataframe and calculate the sum of the duplicate samples, and then re-transpose back. But I am curious to know if there is a way to sum by duplicate column names instead of transposing.

I have included the some example inout and expected data below:

dput(Input)

structure(list(Geneid = c("ENSG00000223972", "ENSG00000227232", 
                          "ENSG00000278267", "ENSG00000243485", "ENSG00000237613", "ENSG00000268020", 
                          "ENSG00000240361", "ENSG00000186092"), Sample_1 = c(0L, 22L, 
                                                                              1L, 0L, 0L, 4L, 0L, 6L), Sample_1 = c(1L, 27L, 2L, 0L, 0L, 2L, 
                                                                                                                    0L, 1L), Sample_2 = c(0L, 0L, 1L, 0L, 44L, 4L, 4L, 11L), Sample_3 = c(1L, 
                                                                                                                                                                                          222L, 112L, 10L, 11L, 4L, 1L, 6L), Sample_3 = c(0L, 22L, 1L, 
                                                                                                                                                                                                                                          0L, 42L, 76L, 66L, 6L), Sample_4 = c(440L, 440L, 440L, 0L, 42L, 
                                                                                                                                                                                                                                                                               0L, 0L, 6L), Sample_4 = c(840L, 0L, 10L, 0L, 42L, 0L, 0L, 6L), 
               Sample_5 = c(0L, 0L, 10L, 0L, 342L, 0L, 40L, 6L)), class = "data.frame", row.names = c(NA, 
                                                                                                      -8L))
#>            Geneid Sample_1 Sample_1 Sample_2 Sample_3 Sample_3 Sample_4
#> 1 ENSG00000223972        0        1        0        1        0      440
#> 2 ENSG00000227232       22       27        0      222       22      440
#> 3 ENSG00000278267        1        2        1      112        1      440
#> 4 ENSG00000243485        0        0        0       10        0        0
#> 5 ENSG00000237613        0        0       44       11       42       42
#> 6 ENSG00000268020        4        2        4        4       76        0
#> 7 ENSG00000240361        0        0        4        1       66        0
#> 8 ENSG00000186092        6        1       11        6        6        6
#>   Sample_4 Sample_5
#> 1      840        0
#> 2        0        0
#> 3       10       10
#> 4        0        0
#> 5       42      342
#> 6        0        0
#> 7        0       40
#> 8        6        6


dput(Expected_Output)

structure(list(Geneid = c("ENSG00000223972", "ENSG00000227232", 
                          "ENSG00000278267", "ENSG00000243485", "ENSG00000237613", "ENSG00000268020", 
                          "ENSG00000240361", "ENSG00000186092"), Sample_1 = c(1L, 49L, 
                                                                              3L, 0L, 0L, 6L, 0L, 7L), Sample_2 = c(0L, 0L, 1L, 0L, 44L, 4L, 
                                                                                                                    4L, 11L), Sample_3 = c(1L, 244L, 113L, 10L, 53L, 80L, 67L, 12L
                                                                                                                    ), Sample_4 = c(1280L, 440L, 450L, 0L, 84L, 0L, 0L, 12L), Sample_5 = c(0L, 
                                                                                                                                                                                           0L, 10L, 0L, 342L, 0L, 40L, 6L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                 -8L))
#>            Geneid Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
#> 1 ENSG00000223972        1        0        1     1280        0
#> 2 ENSG00000227232       49        0      244      440        0
#> 3 ENSG00000278267        3        1      113      450       10
#> 4 ENSG00000243485        0        0       10        0        0
#> 5 ENSG00000237613        0       44       53       84      342
#> 6 ENSG00000268020        6        4       80        0        0
#> 7 ENSG00000240361        0        4       67        0       40
#> 8 ENSG00000186092        7       11       12       12        6

^{Created on 2021-09-18 by the reprex package (v2.0.1)}

Thank you,
Toufiq

FJCC · September 18, 2021, 8:41pm

DF <- structure(list(Geneid = c("ENSG00000223972", "ENSG00000227232", 
                          "ENSG00000278267", "ENSG00000243485", "ENSG00000237613", "ENSG00000268020", 
                          "ENSG00000240361", "ENSG00000186092"), Sample_1 = c(0L, 22L, 
                                                                              1L, 0L, 0L, 4L, 0L, 6L), Sample_1 = c(1L, 27L, 2L, 0L, 0L, 2L, 
                                                                                                                    0L, 1L), Sample_2 = c(0L, 0L, 1L, 0L, 44L, 4L, 4L, 11L), Sample_3 = c(1L, 
                                                                                                                                                                                          222L, 112L, 10L, 11L, 4L, 1L, 6L), Sample_3 = c(0L, 22L, 1L, 
                                                                                                                                                                                                                                          0L, 42L, 76L, 66L, 6L), Sample_4 = c(440L, 440L, 440L, 0L, 42L, 
                                                                                                                                                                                                                                                                               0L, 0L, 6L), Sample_4 = c(840L, 0L, 10L, 0L, 42L, 0L, 0L, 6L), 
               Sample_5 = c(0L, 0L, 10L, 0L, 342L, 0L, 40L, 6L)), class = "data.frame", row.names = c(NA, 
                                                                                                      -8L))
library(dplyr)
library(tidyr)
DFlong <- DF %>% pivot_longer(cols = -Geneid,names_to="Sample",values_to="value")
DFsum <- DFlong %>% group_by(Geneid,Sample) %>% summarize(Total=sum(value))
#> `summarise()` regrouping output by 'Geneid' (override with `.groups` argument)
DFsum
#> # A tibble: 40 x 3
#> # Groups:   Geneid [8]
#>    Geneid          Sample   Total
#>    <chr>           <chr>    <int>
#>  1 ENSG00000186092 Sample_1     7
#>  2 ENSG00000186092 Sample_2    11
#>  3 ENSG00000186092 Sample_3    12
#>  4 ENSG00000186092 Sample_4    12
#>  5 ENSG00000186092 Sample_5     6
#>  6 ENSG00000223972 Sample_1     1
#>  7 ENSG00000223972 Sample_2     0
#>  8 ENSG00000223972 Sample_3     1
#>  9 ENSG00000223972 Sample_4  1280
#> 10 ENSG00000223972 Sample_5     0
#> # ... with 30 more rows
DFfinal <- DFsum %>% pivot_wider(names_from = "Sample",values_from="Total")
DFfinal
#> # A tibble: 8 x 6
#> # Groups:   Geneid [8]
#>   Geneid          Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
#>   <chr>              <int>    <int>    <int>    <int>    <int>
#> 1 ENSG00000186092        7       11       12       12        6
#> 2 ENSG00000223972        1        0        1     1280        0
#> 3 ENSG00000227232       49        0      244      440        0
#> 4 ENSG00000237613        0       44       53       84      342
#> 5 ENSG00000240361        0        4       67        0       40
#> 6 ENSG00000243485        0        0       10        0        0
#> 7 ENSG00000268020        6        4       80        0        0
#> 8 ENSG00000278267        3        1      113      450       10

^{Created on 2021-09-18 by the reprex package (v0.3.0)}

mtoufiq · September 19, 2021, 6:11am

@FJCC , thank you very much. This is very helpful and resolved the issue.

system · September 26, 2021, 6:11am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.