two ways of calculating the average and two different results ...

Alex1193 · July 11, 2022, 3:10pm

Hello,

I have a methylation table with patients in rows and genes in columns.
I want to calculate the average methylation of each gene for each group:

tapply(dfX$gene1, dfX$status, mean)
group1 group2
0.06449247 0.06124757

in the second method, I use aggregate to calculate mean for all genes:

resume <- aggregate(data=dfX, .~status, mean)
resume$gene1
[1] 0.05448438 0.05161707 for group 1 and 2 respectively

As you can see, these are two different average results !!

I check and I found that the first method give the right result !
Could you explain me why the function aggregate don't give the good results or if I made a mistake with this function

thanks in advance
Alex

dvetsch75 · July 11, 2022, 4:17pm

I'm having a tough time reproducing your issue. Can you maybe post a sample of your data and we can try with that?

df <- data.frame(
    'a' = c(runif(500), runif(500, min = 1, max = 2)),
    'status' = c(rep(0, 500), rep(1, 500))
)

tapply(
    df$a,
    df$status,
    mean
)
#>        0        1 
#> 0.508250 1.496064

aggregate(
    data = df,
    . ~ status,
    mean
)
#>   status        a
#> 1      0 0.508250
#> 2      1 1.496064

^{Created on 2022-07-11 by the reprex package (v1.0.0)}

EconProf · July 12, 2022, 1:45am

I remember something about aggregate() when there are NAs, like dropping an entire row if any of the values in the row are NA. Of course, at 66 my memory is not perfect. What happens when you do

aggregate(data = dfX, gene1 ~ status, mean)

instead of including all variables with . ~ status and then selecting gene1?

Alex1193 · July 12, 2022, 7:06am

Hi, EconProf,
thanks for your answer.
this is the result:

 status       Gene1
1 KRT19high 0.06449247
2 KRT19low 0.06124757

same as tapply

Alex1193 · July 12, 2022, 7:17am

Hi dvetsch75,

here is the file.
I reduced the table to 50 variables. It's funny because the result with aggregate is different than with the complete table and always different with tapply...

tapply(df2$gene1, df2$status, mean)
group1 group2
0.06449247 0.06124757
df3 <- aggregate(data=df2, .~status, mean)
df3$gene1
[1] 0.06200115 0.05979080

Alex1193 · July 12, 2022, 9:05am

Update:

I use

na.rm=TRUE, na.action=NULL

as argument to aggregate function and found the same results as tapply
thanks for your replies !!!
Alex

system · July 19, 2022, 9:06am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.