Selecting for cells that express example: gene A+, gene B+ & gene C- in scRNA seq data (Seurat) in R?

AlexisW · December 18, 2023, 6:44pm

It's not hard, but may be a bad idea because of sparsity (see below).

First, how to do it: let's take some example data from Seurat:

library(Seurat)
library(SeuratData)

# if needed
# InstallData("pbmc3k")


pbmc3k <- LoadData("pbmc3k")

features <- c("LYZ", "IL32", "PTPRCAP")

RidgePlot(pbmc3k, features = features, ncol = 2)

So we can ask for the cells that express LYZ and IL32 but not PTPRCAP, here using the most basic threshold possible:

pbmc_subset <- subset(pbmc3k, LYZ >0 & IL32 > 0 & PTPRCAP == 0)

Note, you could also do it more manually:


pbmc_subset <- subset(pbmc3k, LYZ >0 & IL32 > 0 & PTPRCAP == 0)


gene_expression <- FetchData(pbmc3k, vars = features)

head(gene_expression)
#>                LYZ IL32 PTPRCAP
#> AAACATACAACCAC   1    0       1
#> AAACATTGAGCTAC   3    0       9
#> AAACATTGATCAGC   2   14       1
#> AAACCGTGCTTCCG  24    0       0
#> AAACCGTGTATGCG   0    0       0
#> AAACGCACTGGTAC   1    7       3

is_right_expression <- (gene_expression$LYZ > 0 &
                          gene_expression$IL32 > 0 &
                          gene_expression$PTPRCAP == 0)

table(is_right_expression)
#> 
#> FALSE  TRUE 
#>  2513   187

cells_to_keep <- rownames(gene_expression)[is_right_expression]

head(cells_to_keep)
#> [1] "AAACGCTGACCAGT" "AAACGCTGGTTCTT" "AAAGAGACGGCATT" "AAAGCAGAAGCCAT"
#> [5] "AAATCAACGGAAGC" "AAATCAACTCGCAA"
pbmc_subset2 <- subset(pbmc3k, cells = cells_to_keep)

all.equal(pbmc_subset, pbmc_subset2)
#> [1] TRUE

^{Created on 2023-12-18 with reprex v2.0.2}

Here, I used the raw counts because that's what was available in the object, you could use some normalized count and adapt the thresholds. See DefaultAssay().

The sparsity problem

There is a problem with this approach: scRNA-Seq data is sparse, in most individual cells a gene is not measured even if expressed. In my example above, it's not obvious because these genes seem highly expressed, but in general it is common that in a given cluster, only a small fraction of the cells have non-zero reads. So you'll rarely find co-expression in a single cell, even if there is co-expression at the level of the cell type.

This is still an active area of research (see co-expression network papers), so if you don't want to go into heavy mathematics, there are two "easy" approaches:

work at the level of clusters, using AggregateExpression()to make a "pseudo-bulk", and then compare the number of reads per same cluster.
use some imputation method, which will essentially "smooth" the distribution of read, and make each cell more similar to its neighbors. As suggested here, you might want to try out a few methods.