Can we use UMAP for rna-seq data?

Hi all,
I want to use UMAP for clustering RNA-seq data I have an expression matrix file. I want to the see clustering pattern in replicates. Because DEGs show a large number of differences in replicates. I am not familiar with UMAP my question is it possible to see a clustering pattern of replicates using CMP (Count per million) matric data?

My matrix file looks like

Gene 1.rep 2.rep 3.rep 4.rep 5.rep
MSTRG.10603.1 0.353527679 0.863557219 0.154658336 0.302840468 1.68378386
MSTRG.12772.1 12.66807516 12.70662765 10.28477935 12.77986775 14.5417697
MSTRG.8334.1 13.78757948 13.0767236 13.37794608 14.65747865 10.86805946
MSTRG.11583.1 35.94198069 37.44137372 24.51334628 33.67586004 34.13489099
MSTRG.4366.1 41.18597459 42.56103437 30.77700889 36.52256043 24.64447286
MSTRG.4203.1 82.07734278 85.4921647 113.1325729 84.85589912 54.95258235
MSTRG.6397.1 4.890466225 5.304708632 5.026395925 4.663743206 4.285995281
MSTRG.785.1 54.08973487 72.72385439 55.36768434 54.45071614 68.26978197
MSTRG.6825.1 534.4160079 471.440559 563.2656603 515.7373169 505.1351581
MSTRG.1448.1 58.86235854 63.16304232 49.49066757 57.96366556 41.17616895
up to 10500 genes.

Kindly suggest to me how I can see a clustering pattern of 5 replicates. I am sorry for the lame question I am new to R.

Thank you in advance

Technically, yes, it's perfectly possible. Whether it's the right thing to do... I would say no.

For (bulk) RNA-Seq, the typical packages to use are {DESeq2} (vignette) and {edgeR} (user guide). If you read the linked vignettes, you'll see that both have a step where they plot a PCA or MDS, this is what you want to do here: it will take each replicate as a point in a 10500-dimension space, and try to reduce the dimensionality so it can be plotted in 2D on a screen. So, if your experiment worked as expected, the dimensions that are kept correspond to the experimental parameters (treatment, batch, ...) so the samples do cluster.

So, why does scRNA-Seq use UMAP? Basically because there are too many samples. PCA has some strong constraints on the reduced dimensions it finds: they have to be orthogonal. So, if there is a lot of information contained in many dimensions, PCA will fail to show all that information in the first 2 principal components, you'd need to look at many more. OTOH, UMAP can "torture" its axes until all the information is in 2D, but in the process the axes become meaningless, so the distances and positions of the clusters are hard to interpret. In summary, if you have thousands of samples (single cells) in hundreds of conditions (clusters), PCA can't show it properly. If you have barely a dozen samples in a couple of conditions (like here), then PCA will capture anything important, and the representation can be interpreted.

Final note, from the format of your matrix, I think you used StringTie to discover and quantify transcripts, and took the "TPM" output. I've never done de novo genome annotation, so I won't comment more than warning that novel transcript discovery is a hard problem, make sure you read up on the available methods, and if it's a species that already has an annotation, you probably want to consider using it directly (see the user guides above for recommendations, e.g. Salmon). But if you want to load that data into edgeR or DESeq2, make sure you use the count output, not TPM or FPKM.

All that being said, if you really want to do a UMAP, see the {uwot} and {umap} packages. With 5 replicates a PCA would be much better suited though.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.