How can I find common features between multiple tables?

DC7 · December 14, 2020, 5:55am

I have multiple tables with tab separated format (four attached) each with five columns. The first column is the features , the third is the sample_type (although not mentioned in the tables).

TABLE: a.csv

Streptococcaceae	4.17705641901	control	3.47827812404	0.337904019103
Firmicutes_unclassified	4.43447010448	control	3.9217759446	0.0348055639329
Actinomycetaceae	2.51369314622	control	2.16004135478	0.0243610387108
Eubacteriaceae	4.80377020941	control	4.27387393831	0.0845334692571
Tannerellaceae	4.14816223344	control	3.275816707	0.949059680886
Oxalobacteraceae	2.38299415571	test	2.0825523478	0.65669077294
Staphylococcaceae	1.7758703522	test	3.29562599424	0.317310507863
Propionibacteriaceae	1.491161512	test	2.37370336438	0.916511907864
Lactobacillaceae	3.24263012709	test	2.65310263606	0.654720846019
Clostridiales_unclassified	3.33143883197	test	2.68073144255	0.22479909925

TABLE b.csv

Streptococcaceae	4.14574536955	control	3.75857890864	0.0412500165939
Firmicutes_unclassified	4.54285192263	control	3.6456600168	0.545349668011
Eubacteriaceae	4.72053565932	control	4.06108447754	0.449691797969
Pseudomonadaceae	2.47280035866	control	2.20503368624	0.146793087309
Erysipelotrichaceae	3.71662926551	control	3.05668482543	0.939742989577
Victivallaceae	0.527629857442	control	2.74539202586	0.317310507863
Tannerellaceae	4.50079062345	test	3.58357798914	0.820595839755
Veillonellaceae	3.87333100809	test	3.19818181055	0.289192119229
Rikenellaceae	4.9250780149	test	4.35390701567	0.545349668011
Lactobacillaceae	3.0789422455	test	2.71335701105	0.412191111952
Helicobacteraceae	2.96591152623	test	2.78958989566	0.317310507863

TABLE: c.csv

Firmicutes_unclassified	4.3949137452	control	3.80320556213	0.202318749895
Eubacteriaceae	4.67973931478	control	3.77752868209	0.914742321683
Tannerellaceae	4.24480153834	control	3.57432876491	0.0575043321308
Veillonellaceae	4.31037823286	control	3.77586730047	0.488677469577
Burkholderiaceae	0	control	2.38754970657	0.356028985238
Streptococcaceae	3.70445898562	test	2.99074124872	0.976675416504
Proteobacteria_unclassified	3.54646811621	test	3.11297394996	0.0769750871296
Erysipelotrichaceae	4.0459816119	test	3.44639785866	0.526971732416
Sutterellaceae	3.52589117576	test	2.94530632209	0.173128322568
Acidaminococcaceae	4.43174037412	test	3.88725876468	0.174392135504

TABLE: d.csv

Firmicutes_unclassified	4.48268161942	control	4.08546779118	0.0378086594704
Sutterellaceae	3.72334956471	control	3.17665465678	0.369158808941
Proteobacteria_unclassified	3.41146437028	control	3.03389982079	0.195865532789
Tannerellaceae	4.51120731157	control	3.82196492147	0.143867728482
Streptococcaceae	3.14327432538	test	2.56433651799	0.89796929457
Eubacteriaceae	4.80334041206	test	4.25388289196	0.488744120387
Veillonellaceae	4.16830871806	test	3.64943977262	0.411122991069
Pseudomonadaceae	0.591920446836	test	2.86633204137	0.317310507863
Caulobacteraceae	0.848425939934	test	2.67734038571	0.317310507863

Now, I want some script for the following tasks:

I want to find out the features with the same sample_type across all the four files and write them in an output file with their name and sample_type.
I will also like to get the features with the same sample_type among three and two out of four files and write them separately in the same file.
It would be great if the corresponding file name of the common features are mentioned.

Can anyone please help me?

Thanks

AlexisW · December 14, 2020, 6:24am

Why bash? In R, it should be possible by reading the table, then using intersect() and table() to find the features of interest, and subsetting the tables accordingly. In bash, you might be able to get that with some awk magic, but that seems exceedingly complicated, R is just better suited for this. Also, this is an R forum...

DC7 · December 14, 2020, 6:27am

H sorry @AlexisW, I mistakenly wrote about bash. I just need some R solution only.

AlexisW · December 14, 2020, 6:32am

Then, just extract the feature columns of each data frame, make entries unique() if needed, c()oncatenate them together, and make a table(). You can very simply filter the table to keep entries that appear 4 times, or 3 or 2 times, and get the corresponding names() of the table: these are the features of interest. So finally you just have to select the corresponding rows in the original data frames.

Is there a step that doesn't work?

DC7 · December 14, 2020, 6:34am

Please give me some time and let you know. I am absolutely beginner in R.

DC7 · December 14, 2020, 6:58am

Hi @AlexisW, I have used this:

data_frame_1 <- data.frame(read.table("~/Documents/A.csv"))
data_frame_2 <- data.frame(read.table("~/Documents/B.csv"))
data_frame_3 <- data.frame(read.table("~/Documents/C.csv"))
data_frame_4 <- data.frame(read.table("~/Documents/D.csv"))
intersect <- Reduce(intersect, list(data_frame_1$V1, data_frame_2$V1, data_frame_3$V1, data_frame_4$V1))

And this reverts:

> Reduce(intersect, list(data_frame_1$V1, data_frame_2$V1, data_frame_3$V1, data_frame_4$V1))
 [1] "Streptococcaceae"           "Firmicutes_unclassified"    "Eubacteriaceae"             "Tannerellaceae"            
 [5] "Veillonellaceae"            "Erysipelotrichaceae"        "Clostridiaceae"             "Rikenellaceae"             
 [9] "Ruminococcaceae"            "Akkermansiaceae"            "Eggerthellaceae"            "Enterobacteriaceae"        
[13] "Oscillospiraceae"           "Desulfovibrionaceae"        "Lactobacillaceae"           "Clostridiales_unclassified"
[17] "Odoribacteraceae"           "Barnesiellaceae"            "Lachnospiraceae"            "Sutterellaceae"            
[21] "Acidaminococcaceae"         "Prevotellaceae"             "Selenomonadaceae"           "Coriobacteriaceae"         
[25] "Bacteroidaceae"             "Bifidobacteriaceae"

it means, it reverts only the common features in the first column. But I need the common features across the four tables those match with the third column as well.

AlexisW · December 14, 2020, 2:54pm

Oh, missed that. You can simply paste(V1, V3) together, do the filtering on these vectors, and at the end strsplit() it to recover columns V1 and V3.

DC7 · December 14, 2020, 3:13pm

Hi @AlexisW- I am trying a different approach to address the problem. Probably you may want to help me. I have extracted out the rows with "control" element from each dataframe. Then extracted out only the first column from it. Resulting into only a single column from each dataframe like:

> a_control
                           V1
23           Oxalobacteraceae
24          Staphylococcaceae
25       Propionibacteriaceae
26           Lactobacillaceae
27 Clostridiales_unclassified
28        Methanobacteriaceae
29           Odoribacteraceae
30           Peptoniphilaceae
31            Barnesiellaceae
32            Lachnospiraceae
33           Leuconostocaceae
34             Sutterellaceae
35         Acidaminococcaceae
36             Prevotellaceae
37           Selenomonadaceae
38          Coriobacteriaceae
39             Bacteroidaceae
40         Bifidobacteriaceae

I have 3 more dataframes like the one above. They are b_control, c_control, d_control. Now I want to make a single table from these three. That means, a table with four columns will be formed. Now, from this table I will like to get the elements which are duplicated 4 times, 3 times and 2 times in the dataframe. I think this will make the problem much easier.
So, can you tell me how can I proceed from this step i.e., how to make a single table from these 4 dataframes and find the elements with those frequencies?

Thanks a lot for your replies,
DC7

system · January 4, 2021, 3:13pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.