That's a typical case of bioinformatics: the commands work fine, but you are coming after other people and it takes a lot of work to understand and clean up what they did.
If the input files have the right format, the command can be used like:
library(edgeR)
file_paths <- c("GSE_data/GSE109448_rnaseq_gene_counts.tsv.gz",
"GSE_data/GSE193269_Raw_counts_BMP_NMKs.txt.gz")
dge <- readDGE(file_paths)
If you run it, you'll see that the first file (GSE109448) is loaded fine; you can check with:
dge <- readDGE(file_paths[1])
But now, if you look at the content of that object, you'll notice that the values are not integer:
head(dge$counts)
Samples
Tags GSE_data/GSE109448_rnaseq_gene_counts.tsv
ENSG00000000003 60.42262
ENSG00000000005 0.00000
ENSG00000000419 69.20553
ENSG00000000457 42.27528
ENSG00000000460 15.16422
ENSG00000000938 17.00000
So, that means this file does NOT contain raw reads, and is not appropriate for reanalysis. To understand what exactly is in these files, it might take some investigation, maybe the methods of the paper explain it.
Now let's go to the other dataset: GSE193269.
dge <- readDGE(file_paths[2])
Error in readDGE(file_paths[2]) :
There are 2057 repeated row names in GSE_data/GSE193269_Raw_counts_BMP_NMKs.txt.gz. Row names must be unique.
The error message is clear, we need to investigate why this happens. So let's load the table manually:
dat <- read.delim("GSE_data/GSE193269_Raw_counts_BMP_NMKs.txt.gz")
head(dat)
head(dat)
gene_name gene_id B1 B2 B3 B4 B5 B6 C1 C2
1 TSPAN6 ENSG00000000003.10 146 156 175 157 142 112 94 147
2 TNMD ENSG00000000005.5 0 0 0 0 0 0 0 0
3 DPM1 ENSG00000000419.8 207 220 326 254 286 208 200 198
4 SCYL3 ENSG00000000457.9 118 144 178 109 141 109 77 84
5 C1orf112 ENSG00000000460.12 16 34 29 32 21 21 15 28
6 FGR ENSG00000000938.8 0 0 1 0 0 0 1 0
C3 C4 C5 C6
1 156 178 169 199
2 0 0 0 0
3 226 251 283 284
4 130 132 112 136
5 23 34 32 23
6 1 0 0 0
So, the table looks good, but you notice it has two columns with gene identifiers, so you'd need to change the columns
argument of readDGE()
, or to preprocess the file (we'll come back to that in a minute).
Let's check this duplicate thing:
> anyDuplicated(dat$gene_name)
[1] 4793
> anyDuplicated(dat$gene_id)
[1] 0
so as edgeR noticed, some gene names are duplicated, but gene IDs are not. So if we simply use gene IDs rather than gene names we'll get rid of the error. But first let's look at what those duplicates are:
> position_duplicates <- which(duplicated(dat$gene_name))
> names_duplicates <- dat$gene_name[position_duplicates]
>
> head(names_duplicates)
[1] "2.Mar" "TMEM236" "1.Mar" "Y_RNA" "GOLGA6L9"
[6] "Y_RNA"
So you have problems.
> dat[dat$gene_name == "2.Mar", 1:2]
gene_name gene_id
2129 2.Mar ENSG00000099785.6
4793 2.Mar ENSG00000117791.11
> dat[dat$gene_name == "1.Mar",1:2]
gene_name gene_id
8769 1.Mar ENSG00000145416.9
16458 1.Mar ENSG00000186205.8
So some problems are due to the fact the data was in Excel at some point, which automatically changes gene names like MTARC2 or MARCHF2 to dates. Bad. Known. See this, that, and that for details.
Others might be due to changes in the databases:
> dat[dat$gene_name == "TMEM236", 1:2]
gene_name gene_id
9177 TMEM236 ENSG00000148483.7
15876 TMEM236 ENSG00000184040.7
ENSG00000184040 is deprecated, not sure why it has the same gene name as ENSG00000148483 (I don't know human genetics).
While you should probably look more at the duplicates to understand the dataset, we can "solve" all these problems by simply removing the gene_name column, and only working with gene IDs.
dat <- read.delim("GSE_data/GSE193269_Raw_counts_BMP_NMKs.txt.gz")
write.table(dat[,-1],
"GSE_data/edited_GSE193269_Raw_counts_BMP_NMKs.txt.gz",
sep = "\t", row.names = FALSE)
dat2 <- read.delim("GSE_data/edited_GSE193269_Raw_counts_BMP_NMKs.txt.gz")
waldo::compare(dat2, dat[,-1])
✔ No differences
Yay, now you can directly read it with edgeR!
dge <- readDGE("GSE_data/edited_GSE193269_Raw_counts_BMP_NMKs.txt.gz")
OK, so we solved the problem of the second file, but note that at this point, the first file has gene IDs like ENSG00000000003
, while the second has the version number ENSG00000000003.10
. So to make them compatible, you'll also have to modify the second to remove the version number.
Code like this should do the trick:
strsplit(dat[,2], split = ".", fixed = TRUE) |>
sapply(\(x) x[[1]])
And of course, all that is useless if you don't get proper unnormalized data for the first file.