Selecting for cells that express example: gene A+, gene B+ & gene C- in scRNA seq data (Seurat) in R?

Does any one have an Idea?

It's not hard, but may be a bad idea because of sparsity (see below).

First, how to do it: let's take some example data from Seurat:

library(Seurat)
library(SeuratData)

# if needed
# InstallData("pbmc3k")


pbmc3k <- LoadData("pbmc3k")

features <- c("LYZ", "IL32", "PTPRCAP")

RidgePlot(pbmc3k, features = features, ncol = 2)

So we can ask for the cells that express LYZ and IL32 but not PTPRCAP, here using the most basic threshold possible:

pbmc_subset <- subset(pbmc3k, LYZ >0 & IL32 > 0 & PTPRCAP == 0)

Note, you could also do it more manually:


pbmc_subset <- subset(pbmc3k, LYZ >0 & IL32 > 0 & PTPRCAP == 0)


gene_expression <- FetchData(pbmc3k, vars = features)

head(gene_expression)
#>                LYZ IL32 PTPRCAP
#> AAACATACAACCAC   1    0       1
#> AAACATTGAGCTAC   3    0       9
#> AAACATTGATCAGC   2   14       1
#> AAACCGTGCTTCCG  24    0       0
#> AAACCGTGTATGCG   0    0       0
#> AAACGCACTGGTAC   1    7       3

is_right_expression <- (gene_expression$LYZ > 0 &
                          gene_expression$IL32 > 0 &
                          gene_expression$PTPRCAP == 0)

table(is_right_expression)
#> 
#> FALSE  TRUE 
#>  2513   187

cells_to_keep <- rownames(gene_expression)[is_right_expression]

head(cells_to_keep)
#> [1] "AAACGCTGACCAGT" "AAACGCTGGTTCTT" "AAAGAGACGGCATT" "AAAGCAGAAGCCAT"
#> [5] "AAATCAACGGAAGC" "AAATCAACTCGCAA"
pbmc_subset2 <- subset(pbmc3k, cells = cells_to_keep)

all.equal(pbmc_subset, pbmc_subset2)
#> [1] TRUE

Created on 2023-12-18 with reprex v2.0.2

Here, I used the raw counts because that's what was available in the object, you could use some normalized count and adapt the thresholds. See DefaultAssay().

The sparsity problem

There is a problem with this approach: scRNA-Seq data is sparse, in most individual cells a gene is not measured even if expressed. In my example above, it's not obvious because these genes seem highly expressed, but in general it is common that in a given cluster, only a small fraction of the cells have non-zero reads. So you'll rarely find co-expression in a single cell, even if there is co-expression at the level of the cell type.

This is still an active area of research (see co-expression network papers), so if you don't want to go into heavy mathematics, there are two "easy" approaches:

  • work at the level of clusters, using AggregateExpression()to make a "pseudo-bulk", and then compare the number of reads per same cluster.
  • use some imputation method, which will essentially "smooth" the distribution of read, and make each cell more similar to its neighbors. As suggested here, you might want to try out a few methods.
1 Like

Thank you AlexisW

That was simpler than I thought, I was trying a more difficult way (you saved my time).

Btw: still the thing is that I wanted to plot the subset together with the original Seurat object (in the way like to overlay the subset on top of the original object, in order to have more insight of these cell distribution in the different clusters).

Thank you

Do you mean, plotting gene expression on the UMAP? The simplest is to give several genes in FeaturePlot() which will plot the UMAPs for all of them using faceting.

I would also recommend DoHeatmap() that doesn't keep the UMAP coordinates, but allows you to visualize the individual cells per cluster and compare easily.

More complex, you could extract the genes of interest and color the UMAP based on their combination. I don't think you could use pre-built Seurat functions, so you'd want to extract the UMAP coordinates for the object and use them in {ggplot} with geom_point() to rebuild the UMAP, and separately extract the genes of interest and build a color scale.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.