subsetting data based on values in one column from another data frame

mscr · July 21, 2022, 4:09pm

I have a data frame with a lot of RNA seq counts (Sample names as column names and genes as row names), and a file of metadata i.e. sex, tissue type, disease status etc. (sample names as row names and sex etc and column names) I would like to a subset of the RNAseq counts data that just contains 2 of the tissues types, so that I can look at DGE. Could someone suggest the best way to do this? I'm very new at working with RNA seq data so this may be obvious!

This is the dataframe beginning (it is very big so can't post it all)

dput(head(tpm.df[1:2])) 
structure(list(Description = c("DDX11L1", "WASH7P", "MIR6859-1", 
"MIR1302-2HG", "FAM138A", "OR4G4P"), `GTEX-1117F-0226-SM-5GZZ7` = c(0L, 
187L, 0L, 1L, 0L, 0L)), row.names = c("ENSG00000223972.5", 
"ENSG00000227232.5", 
"ENSG00000278267.1", "ENSG00000243485.5", "ENSG00000237613.2", 
"ENSG00000268020.3"), class = "data.frame")

And this is the metadata (also just the beginning)

structure(list(SMATSSCR = c(NA, NA, NA, NA, NA, 0L), SMCENTER = c("B1", 
"B1", "B1", "B1, A1", "B1, A1", "B1"), SMPTHNTS = c("", "", "", 
"", "", "2 pieces, ~15% vessel stroma, rep delineated")), row.names = 
c("GTEX-1117F-0003-SM-58Q7G", 
"GTEX-1117F-0003-SM-5DWSB", "GTEX-1117F-0003-SM-6WBT7", "GTEX-1117F- 
0011-R10a-SM-AHZ7F", 
"GTEX-1117F-0011-R10b-SM-CYKQ8", "GTEX-1117F-0226-SM-5GZZ7"), class = 
"data.frame")

This is missing the tissue type column but it is called SMTSD and contains info e.g. " Heart - Left Ventricle)

I tried to subset out the tissues e.g.

subset_lv_samples <- metadata[metadata$SMTSD%in% c("Heart - Left Ventricle"),]
subset_adipose_samples <- metadata[metadata$SMTSD%in% c("Adipose"),]
lv_samples <- rownames(subset_lv_samples)
adipose_samples <- rownames(subset_adipose_samples)
subset_tpm.df <- tpm.df[c(adipose_samples, lv_samples)]

this returns the error:

Error in `[.data.frame`(tpm.df. , c(adipose_samples, lv_samples)) :
  undefined columns selected

Could anyone suggest how else to tackle this? Or is there an error in my working?

jrkrideau · July 21, 2022, 4:31pm

I do not see SMTSD in metadata.

How are the data arranged in the two data.frames? For example does the data in row one of metadata and tpm.df correspond? If so, and the nrows are equal you could just do a cbind().

nirgrahamuk · July 21, 2022, 4:44pm

Its hard to support you because your example doesnt duplicate to your error message. rather your example runs without any formal error, it simply produced an empty dataframe (with 0 columns and 6 rows).

I do think I can still give you useful advise. All these sorts of manipulations concerning tabular data are greatly simplified to the programmer through more convenient syntax and powerful tooling of dplyr and other sister packages that form the tidyverse. You can read about using such syntax in this easy to read book https://r4ds.had.co.nz/

system · August 11, 2022, 4:44pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.