It's not hard, but may be a bad idea because of sparsity (see below).
First, how to do it: let's take some example data from Seurat:
library(Seurat)
library(SeuratData)
# if needed
# InstallData("pbmc3k")
pbmc3k <- LoadData("pbmc3k")
features <- c("LYZ", "IL32", "PTPRCAP")
RidgePlot(pbmc3k, features = features, ncol = 2)
So we can ask for the cells that express LYZ and IL32 but not PTPRCAP, here using the most basic threshold possible:
pbmc_subset <- subset(pbmc3k, LYZ >0 & IL32 > 0 & PTPRCAP == 0)
Note, you could also do it more manually:
pbmc_subset <- subset(pbmc3k, LYZ >0 & IL32 > 0 & PTPRCAP == 0)
gene_expression <- FetchData(pbmc3k, vars = features)
head(gene_expression)
#> LYZ IL32 PTPRCAP
#> AAACATACAACCAC 1 0 1
#> AAACATTGAGCTAC 3 0 9
#> AAACATTGATCAGC 2 14 1
#> AAACCGTGCTTCCG 24 0 0
#> AAACCGTGTATGCG 0 0 0
#> AAACGCACTGGTAC 1 7 3
is_right_expression <- (gene_expression$LYZ > 0 &
gene_expression$IL32 > 0 &
gene_expression$PTPRCAP == 0)
table(is_right_expression)
#>
#> FALSE TRUE
#> 2513 187
cells_to_keep <- rownames(gene_expression)[is_right_expression]
head(cells_to_keep)
#> [1] "AAACGCTGACCAGT" "AAACGCTGGTTCTT" "AAAGAGACGGCATT" "AAAGCAGAAGCCAT"
#> [5] "AAATCAACGGAAGC" "AAATCAACTCGCAA"
pbmc_subset2 <- subset(pbmc3k, cells = cells_to_keep)
all.equal(pbmc_subset, pbmc_subset2)
#> [1] TRUE
Created on 2023-12-18 with reprex v2.0.2
Here, I used the raw counts because that's what was available in the object, you could use some normalized count and adapt the thresholds. See DefaultAssay()
.
The sparsity problem
There is a problem with this approach: scRNA-Seq data is sparse, in most individual cells a gene is not measured even if expressed. In my example above, it's not obvious because these genes seem highly expressed, but in general it is common that in a given cluster, only a small fraction of the cells have non-zero reads. So you'll rarely find co-expression in a single cell, even if there is co-expression at the level of the cell type.
This is still an active area of research (see co-expression network papers), so if you don't want to go into heavy mathematics, there are two "easy" approaches:
- work at the level of clusters, using
AggregateExpression()
to make a "pseudo-bulk", and then compare the number of reads per same cluster. - use some imputation method, which will essentially "smooth" the distribution of read, and make each cell more similar to its neighbors. As suggested here, you might want to try out a few methods.