I have a data frame with a lot of RNA seq counts (Sample names as column names and genes as row names), and a file of metadata i.e. sex, tissue type, disease status etc. (sample names as row names and sex etc and column names) I would like to a subset of the RNAseq counts data that just contains 2 of the tissues types, so that I can look at DGE. Could someone suggest the best way to do this? I'm very new at working with RNA seq data so this may be obvious!
This is the dataframe beginning (it is very big so can't post it all)
dput(head(tpm.df[1:2]))
structure(list(Description = c("DDX11L1", "WASH7P", "MIR6859-1",
"MIR1302-2HG", "FAM138A", "OR4G4P"), `GTEX-1117F-0226-SM-5GZZ7` = c(0L,
187L, 0L, 1L, 0L, 0L)), row.names = c("ENSG00000223972.5",
"ENSG00000227232.5",
"ENSG00000278267.1", "ENSG00000243485.5", "ENSG00000237613.2",
"ENSG00000268020.3"), class = "data.frame")
And this is the metadata (also just the beginning)
structure(list(SMATSSCR = c(NA, NA, NA, NA, NA, 0L), SMCENTER = c("B1",
"B1", "B1", "B1, A1", "B1, A1", "B1"), SMPTHNTS = c("", "", "",
"", "", "2 pieces, ~15% vessel stroma, rep delineated")), row.names =
c("GTEX-1117F-0003-SM-58Q7G",
"GTEX-1117F-0003-SM-5DWSB", "GTEX-1117F-0003-SM-6WBT7", "GTEX-1117F-
0011-R10a-SM-AHZ7F",
"GTEX-1117F-0011-R10b-SM-CYKQ8", "GTEX-1117F-0226-SM-5GZZ7"), class =
"data.frame")
This is missing the tissue type column but it is called SMTSD and contains info e.g. " Heart - Left Ventricle)
I tried to subset out the tissues e.g.
subset_lv_samples <- metadata[metadata$SMTSD%in% c("Heart - Left Ventricle"),]
subset_adipose_samples <- metadata[metadata$SMTSD%in% c("Adipose"),]
lv_samples <- rownames(subset_lv_samples)
adipose_samples <- rownames(subset_adipose_samples)
subset_tpm.df <- tpm.df[c(adipose_samples, lv_samples)]
this returns the error:
Error in `[.data.frame`(tpm.df. , c(adipose_samples, lv_samples)) :
undefined columns selected
Could anyone suggest how else to tackle this? Or is there an error in my working?