EdgeR analysis on already pubblished data

I am not a bioinformatician and I am trying to analyze some data that are already published about an RNAseq.
The file is found here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE149189, which we will call in R"file".
In this file you have 9 columns, the first is the name of the genes, and the other 8 are the samples.

I wanted to do the EdgeR analysis.
I have run my code: there are 8 groups and I have divided groups like that:
group<- factor(c(1,1,2,2,3,3,4,4)), deciding that the 8 groups are replicates 2 by 2 (to test)
So I did
DGEList(counts=file[,2:9], genes=file[1], group = factor(group))
When I continue the analysis for the calculation of the dispersion(d) and then I run the function "exactTest" to compare each of my samples to each other like this:
et12 <- exactTest(d, pair=c(1,2))
et13 <- exactTest(d, pair=c(1,3))
etc for all the 8 combinations, at the end all these tables are the same: et12 - et13 - et14 etc, as if for my R there was only 1 group and not 4.
Can someone tell me what I do wrong?
I thank you in advice

Which file is it? There are several on that page. If it's GSE149189_CPM_FetusAdult_16623Gene.txt.gz then you have a problem, as this file is in CPM, while edgeR expects raw counts.

Even if I use this file (which I think is probably incorrect, as the dispersion estimates would expect counts not CPMs), and run the standard edgeR pipeline, I do get different results for the different comparisons:

#> Loading required package: limma

file <- read.delim("edger/GSE149189_CPM_FetusAdult_16623Gene.txt.gz")
#>    refGene Adult_DP_ELW30 Adult_DP_ELW33 Fetus_DP_ELW24 Fetus_DP_ELW26
#> 1     A1BG       3.582839      0.7214397      9.9376498      19.706386
#> 2 A1BG-AS1       1.016029      0.3091884      0.3078033       1.140896
#> 3      A2M     243.312184    142.3297450      0.5276628       3.111535
#> 4  A2M-AS1       4.973194      1.9581935     11.3887225      55.852046
#> 5   A4GALT       2.406384      0.8245025      0.0000000       0.000000
#> 6    A4GNT       0.267376      7.7297110      0.0000000       0.000000
#>   Fetus_DSC_ELW25 Fetus_DSC_ELW27 Fetus_IFD_ELW28 Fetus_IFD_ELW29
#> 1       20.161231        2.237523      10.4269245       12.603857
#> 2        0.000000        0.000000       0.4787874        1.018004
#> 3      674.936597      409.466646     399.8406358      726.321502
#> 4        3.070648        1.398452       4.5750791        6.980598
#> 5        1.333308        1.678142      10.7993147        2.860106
#> 6        0.000000        9.901038       0.0000000        0.000000

group<- factor(c(1,1,2,2,3,3,4,4))

d <- DGEList(counts=file[,2:9], genes=file[1], group = factor(group))
d <- estimateDisp(d)
#> Using classic mode.

et12 <- exactTest(d, pair=c(1,2))
et13 <- exactTest(d, pair=c(1,3))

#> Comparison of groups:  2-1 
#>       refGene      logFC   logCPM       PValue          FDR
#> 9124    MXRA5   6.117027 9.910775 3.225688e-23 5.362062e-19
#> 5475     GPX3  -9.425870 7.383168 2.528134e-18 2.041859e-14
#> 4956      FST   5.533198 7.576499 4.878097e-18 2.041859e-14
#> 6446   ITGBL1  -8.144565 6.991279 5.176242e-18 2.041859e-14
#> 9021     MT1X  -7.208880 6.847090 6.484258e-18 2.041859e-14
#> 11243  PRUNE2 -10.105939 7.067758 7.370003e-18 2.041859e-14
#> 8586      MGP -11.073128 9.126894 9.023969e-18 2.142935e-14
#> 4840     FLT1  -6.610540 7.131759 5.337540e-17 1.109074e-13
#> 3045    CSMD2   5.428040 7.642013 7.124004e-17 1.315804e-13
#> 9587     NNAT   5.950003 7.196916 1.290302e-16 2.144870e-13
#> Comparison of groups:  3-1 
#>       refGene     logFC    logCPM       PValue          FDR
#> 8536    MFAP5  9.993961  8.748015 1.572534e-32 2.614023e-28
#> 9124    MXRA5  6.739589  9.910775 1.085525e-26 9.022339e-23
#> 12276   RSPO4 -7.422496  8.736637 4.890468e-25 2.709809e-21
#> 2815   COL1A1  5.711712 15.092326 3.848710e-24 1.599428e-20
#> 3504     DIO3 -9.453647  7.274561 9.289976e-22 3.088545e-18
#> 13156   SLIT3  5.162058  9.778351 2.265243e-21 6.275856e-18
#> 5633      H19  7.567419  7.234105 3.136139e-21 7.447433e-18
#> 11091   PRELP  7.058215  8.144082 7.485624e-21 1.555419e-17
#> 5688     HBG2 10.221019  8.503723 3.053392e-20 5.193196e-17
#> 91     ABI3BP  7.385684  8.256799 3.124103e-20 5.193196e-17

xx <- as.matrix(file[,2:9])
rownames(xx) <- file[,1]


Created on 2022-11-03 by the reprex package (v2.0.1)

So it is unclear to me what exactly you did, could you provide the exact code that you're running, ideally as a reprex?

I thank you for you answer. Yeah, ok thanks. Then, yes I am using the TPM file (so, I cannot really do edgeR on it? in that case, how can I have comparisons between the conditions and obtaining pvalues for each gene between different conditions?), but let's imagine I could do it.
This is what I did:
charge = read_excel("mypath/GSE149189_CPM_FetusAdult_16623Gene.txt.gz", row.names("refGene"))
group<- factor(c(1,1,2,2,3,3,4,4))
#faire l'analyse
analys <- DGEList(counts=charge[,2:9], genes=charge[1], group = factor(group))
d <- analys
apply(d$counts, 2, sum) # total gene counts per sample
#keep only very expressed genes
keep <- rowSums(cpm(d)>100) >= 2
d <- d[keep,]
dim(d) #3308
#we reset the sample size
d$samples$lib.size <- colSums(d$counts)
d1 <- calcNormFactors(d)
#we explore the datas
plotMDS(d, method="bcv", col=as.numeric(d$samples$group))
legend("bottomleft", as.character(unique(d$samples$group)), col=1:3, pch=20)

we Estimating the Dispersion

d2 <- estimateCommonDisp(d1, verbose=T)
#Disp = 0.14035 , BCV = 0.3746
d3 <- estimateTagwiseDisp(d2)
#GLM estimates of dispersion
design.mat <- model.matrix(~ 0 + d$samples$group)
colnames(design.mat) <- levels(d$samples$group)
d4 <- estimateGLMCommonDisp(d,design.mat)
d5 <- estimateGLMTrendedDisp(d4,design.mat, method="power")

You can change method to "auto", "bin.spline", "power", "spline", "bin.loess".

The default is "auto" which chooses "bin.spline" when > 200 tags and "power" otherwise.

d6 <- estimateGLMTagwiseDisp(d5,design.mat)

#The differential expression

et12 <- exactTest(d3, pair=c(1,2)) # compare groups Adult and DP
e12 <- as.data.frame(et12)

et13 <- exactTest(d3, pair=c(1,3)) # compare groups Adult and DSC
e13 <- as.data.frame(et13)

et14<- exactTest(d3, pair=c(1,4)) # compare groups Adult and IFD
e14 <- as.data.frame(et14)

et23 <- exactTest(d3, pair=c(2,3)) # compare groups DP and DSC
e23 <- as.data.frame(et23)

et24 <- exactTest(d3, pair=c(2,4)) # compare groups DP and IFD
e24 <- as.data.frame(et24)

et34 <- exactTest(d3, pair=c(3,4)) # compare groups DSC and IFD
e34 <- as.data.frame(et34)

###all of these files, e34, e12, e24 etc are the same results

topTags(et12, n=10)
#check the total number of differentially expressed genes
de12 <- decideTestsDGE(et12, adjust.method="BH", p.value=0.05)
de13 <- decideTestsDGE(et13, adjust.method="BH", p.value=0.05)
de14 <- decideTestsDGE(et14, adjust.method="BH", p.value=0.05)
de23 <- decideTestsDGE(et23, adjust.method="BH", p.value=0.05)
de24 <- decideTestsDGE(et24, adjust.method="BH", p.value=0.05)
de34 <- decideTestsDGE(et23, adjust.method="BH", p.value=0.05)
#plot it #you can plot for all of them
de1tags12 <- rownames(d1)[as.logical(de1)]
plotSmear(et12, de.tags=de1tags12)
abline(h = c(-2, 2), col = "purple")

I re-ran your code almost exactly (2 minor changes:

  1. I replaced read_excel() by read.delim(), as read_excel() gives the error message "Error: Can't establish that the input is either xls or xlsx.", which makes sense as the file .txt.gz file is indeed not an Excel file,
  2. I replaced write_xlsx() by write.csv() because I don't have the right package installed, but that shouldn't change the result

), and I do get results that are different:

> summary(de12)
Down    269
NotSig 2727
Up      312

> summary(de13)
Down    224
NotSig 2664
Up      420

and similarly, the csv/Excel files are different etc.

My advice here would be to make sure you restart your R session, and re-run everything at once.

I can't promise you that everything you're doing is correct (because I would have to re-read the edgeR manual, haven't used it in a while), but overall the code looks good, and should give you different DE genes for different groups. So I suspect something weird happened while you ran it, maybe you had objects left in memory from a previous session, maybe you didn't rerun everything in the same order, ...

I'm 90% sure you can't: edgeR makes assumptions about read distributions, so if you give CPM, TPM, or other transformed reads as input, these assumptions are wrong and the result of edgeR's analysis are not correct (I would expect that they still go in the right general direction, but when for example you select a p-value at 0.05 it will be more than 5% false positives, all in all you can't publish that)

I was going to say you have to realign everything yourself, but you're in luck! The authors were nice, and also uploaded the raw counts on GEO! If you click on a sample, for example the first one, there is a supplementary file "GSM4491915_Fetus_DP_ELW24.txt.gz" which contains the raw counts. In the samples description they do explain:

Supplementary_files_format_and_content: The tab-delimited text files: (1) 8 files with count values of 26669 genes for each sample produced by htseq-count (2) 1 file with count-per-million (CPM) values of 16623 genes produced by edgeR

So the single file with all samples is in CPMs, but the 8 files uploaded with each of the sample are in raw counts. You just need to click to each sample and download each file, then combine them.

Thanks a lot for your answer, I have run the code again and it sounded fine you have been so kind !!! Thanks :blush:

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.