readDGE on online subsets

MartaLagg · January 16, 2023, 12:40pm

Hello, nice community!
I am very bad at bioinformatics, and I need to solve problems that are very complex for me, but maybe very easy for you.
Topic: RNAseq analysis with EdgeR
Problem: I need to compare some results between them, coming from two different experiments and labs
I need to have a single dataset by merging two different datasets, I am exercising with these two datasets from GEO:

The file called: GSE193269_Raw_counts_BMP_NMKs.txt.gz

GSE109448_rnaseq_gene_counts.tsv.gz

The files then do not have the same number of raws, which is normal, because the experiments are different.

Me, I need to compare them, this means that I need to obtain a single file with both of them!

I know that we can use the function readDGE.

Now, I have tried everything, but I always have error messages. Can someone please tell me how I am supposed to create a DGlist with these two files merged correctly?

I don't know how to do it. I have searched everywhere with zero explanation...
Please..... Help me
Thanks in advance

AlexisW · January 16, 2023, 5:46pm

That's a typical case of bioinformatics: the commands work fine, but you are coming after other people and it takes a lot of work to understand and clean up what they did.

If the input files have the right format, the command can be used like:

library(edgeR)

file_paths <- c("GSE_data/GSE109448_rnaseq_gene_counts.tsv.gz",
                "GSE_data/GSE193269_Raw_counts_BMP_NMKs.txt.gz")

dge <- readDGE(file_paths)

If you run it, you'll see that the first file (GSE109448) is loaded fine; you can check with:

dge <- readDGE(file_paths[1])

But now, if you look at the content of that object, you'll notice that the values are not integer:

head(dge$counts)
                 Samples
Tags              GSE_data/GSE109448_rnaseq_gene_counts.tsv
  ENSG00000000003                                  60.42262
  ENSG00000000005                                   0.00000
  ENSG00000000419                                  69.20553
  ENSG00000000457                                  42.27528
  ENSG00000000460                                  15.16422
  ENSG00000000938                                  17.00000

So, that means this file does NOT contain raw reads, and is not appropriate for reanalysis. To understand what exactly is in these files, it might take some investigation, maybe the methods of the paper explain it.

Now let's go to the other dataset: GSE193269.

dge <- readDGE(file_paths[2])
Error in readDGE(file_paths[2]) : 
  There are 2057 repeated row names in GSE_data/GSE193269_Raw_counts_BMP_NMKs.txt.gz. Row names must be unique.

The error message is clear, we need to investigate why this happens. So let's load the table manually:

dat <- read.delim("GSE_data/GSE193269_Raw_counts_BMP_NMKs.txt.gz")

head(dat)
head(dat)
  gene_name            gene_id  B1  B2  B3  B4  B5  B6  C1  C2
1    TSPAN6 ENSG00000000003.10 146 156 175 157 142 112  94 147
2      TNMD  ENSG00000000005.5   0   0   0   0   0   0   0   0
3      DPM1  ENSG00000000419.8 207 220 326 254 286 208 200 198
4     SCYL3  ENSG00000000457.9 118 144 178 109 141 109  77  84
5  C1orf112 ENSG00000000460.12  16  34  29  32  21  21  15  28
6       FGR  ENSG00000000938.8   0   0   1   0   0   0   1   0
   C3  C4  C5  C6
1 156 178 169 199
2   0   0   0   0
3 226 251 283 284
4 130 132 112 136
5  23  34  32  23
6   1   0   0   0

So, the table looks good, but you notice it has two columns with gene identifiers, so you'd need to change the columns argument of readDGE(), or to preprocess the file (we'll come back to that in a minute).

Let's check this duplicate thing:

> anyDuplicated(dat$gene_name)
[1] 4793
> anyDuplicated(dat$gene_id)
[1] 0

so as edgeR noticed, some gene names are duplicated, but gene IDs are not. So if we simply use gene IDs rather than gene names we'll get rid of the error. But first let's look at what those duplicates are:

> position_duplicates <- which(duplicated(dat$gene_name))
> names_duplicates <- dat$gene_name[position_duplicates]
> 
> head(names_duplicates)
[1] "2.Mar"    "TMEM236"  "1.Mar"    "Y_RNA"    "GOLGA6L9"
[6] "Y_RNA"

So you have problems.

> dat[dat$gene_name == "2.Mar", 1:2]
     gene_name            gene_id
2129     2.Mar  ENSG00000099785.6
4793     2.Mar ENSG00000117791.11
> dat[dat$gene_name == "1.Mar",1:2]
      gene_name           gene_id
8769      1.Mar ENSG00000145416.9
16458     1.Mar ENSG00000186205.8

So some problems are due to the fact the data was in Excel at some point, which automatically changes gene names like MTARC2 or MARCHF2 to dates. Bad. Known. See this, that, and that for details.

Others might be due to changes in the databases:

> dat[dat$gene_name == "TMEM236", 1:2]
      gene_name           gene_id
9177    TMEM236 ENSG00000148483.7
15876   TMEM236 ENSG00000184040.7

ENSG00000184040 is deprecated, not sure why it has the same gene name as ENSG00000148483 (I don't know human genetics).

While you should probably look more at the duplicates to understand the dataset, we can "solve" all these problems by simply removing the gene_name column, and only working with gene IDs.

dat <- read.delim("GSE_data/GSE193269_Raw_counts_BMP_NMKs.txt.gz")

write.table(dat[,-1],
            "GSE_data/edited_GSE193269_Raw_counts_BMP_NMKs.txt.gz",
            sep = "\t", row.names = FALSE)

dat2 <- read.delim("GSE_data/edited_GSE193269_Raw_counts_BMP_NMKs.txt.gz")
waldo::compare(dat2, dat[,-1])
✔ No differences

Yay, now you can directly read it with edgeR!

dge <- readDGE("GSE_data/edited_GSE193269_Raw_counts_BMP_NMKs.txt.gz")

OK, so we solved the problem of the second file, but note that at this point, the first file has gene IDs like ENSG00000000003, while the second has the version number ENSG00000000003.10. So to make them compatible, you'll also have to modify the second to remove the version number.

Code like this should do the trick:

strsplit(dat[,2], split = ".", fixed = TRUE) |>
  sapply(\(x) x[[1]])

And of course, all that is useless if you don't get proper unnormalized data for the first file.

system · February 27, 2023, 5:46pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.