Selecting for cells that express example: gene A+, gene B+ & gene C- in scRNA seq data (Seurat) in R?

It's not hard, but may be a bad idea because of sparsity (see below).

First, how to do it: let's take some example data from Seurat:

library(Seurat)
library(SeuratData)

# if needed
# InstallData("pbmc3k")


pbmc3k <- LoadData("pbmc3k")

features <- c("LYZ", "IL32", "PTPRCAP")

RidgePlot(pbmc3k, features = features, ncol = 2)

So we can ask for the cells that express LYZ and IL32 but not PTPRCAP, here using the most basic threshold possible:

pbmc_subset <- subset(pbmc3k, LYZ >0 & IL32 > 0 & PTPRCAP == 0)

Note, you could also do it more manually:


pbmc_subset <- subset(pbmc3k, LYZ >0 & IL32 > 0 & PTPRCAP == 0)


gene_expression <- FetchData(pbmc3k, vars = features)

head(gene_expression)
#>                LYZ IL32 PTPRCAP
#> AAACATACAACCAC   1    0       1
#> AAACATTGAGCTAC   3    0       9
#> AAACATTGATCAGC   2   14       1
#> AAACCGTGCTTCCG  24    0       0
#> AAACCGTGTATGCG   0    0       0
#> AAACGCACTGGTAC   1    7       3

is_right_expression <- (gene_expression$LYZ > 0 &
                          gene_expression$IL32 > 0 &
                          gene_expression$PTPRCAP == 0)

table(is_right_expression)
#> 
#> FALSE  TRUE 
#>  2513   187

cells_to_keep <- rownames(gene_expression)[is_right_expression]

head(cells_to_keep)
#> [1] "AAACGCTGACCAGT" "AAACGCTGGTTCTT" "AAAGAGACGGCATT" "AAAGCAGAAGCCAT"
#> [5] "AAATCAACGGAAGC" "AAATCAACTCGCAA"
pbmc_subset2 <- subset(pbmc3k, cells = cells_to_keep)

all.equal(pbmc_subset, pbmc_subset2)
#> [1] TRUE

Created on 2023-12-18 with reprex v2.0.2

Here, I used the raw counts because that's what was available in the object, you could use some normalized count and adapt the thresholds. See DefaultAssay().

The sparsity problem

There is a problem with this approach: scRNA-Seq data is sparse, in most individual cells a gene is not measured even if expressed. In my example above, it's not obvious because these genes seem highly expressed, but in general it is common that in a given cluster, only a small fraction of the cells have non-zero reads. So you'll rarely find co-expression in a single cell, even if there is co-expression at the level of the cell type.

This is still an active area of research (see co-expression network papers), so if you don't want to go into heavy mathematics, there are two "easy" approaches:

  • work at the level of clusters, using AggregateExpression()to make a "pseudo-bulk", and then compare the number of reads per same cluster.
  • use some imputation method, which will essentially "smooth" the distribution of read, and make each cell more similar to its neighbors. As suggested here, you might want to try out a few methods.
1 Like