I have multiple tables with tab separated format (four attached) each with five columns. The first column is the features , the third is the sample_type (although not mentioned in the tables).
TABLE: a.csv
Streptococcaceae
4.17705641901
control
3.47827812404
0.337904019103
Firmicutes_unclassified
4.43447010448
control
3.9217759446
0.0348055639329
Actinomycetaceae
2.51369314622
control
2.16004135478
0.0243610387108
Eubacteriaceae
4.80377020941
control
4.27387393831
0.0845334692571
Tannerellaceae
4.14816223344
control
3.275816707
0.949059680886
Oxalobacteraceae
2.38299415571
test
2.0825523478
0.65669077294
Staphylococcaceae
1.7758703522
test
3.29562599424
0.317310507863
Propionibacteriaceae
1.491161512
test
2.37370336438
0.916511907864
Lactobacillaceae
3.24263012709
test
2.65310263606
0.654720846019
Clostridiales_unclassified
3.33143883197
test
2.68073144255
0.22479909925
TABLE b.csv
Streptococcaceae
4.14574536955
control
3.75857890864
0.0412500165939
Firmicutes_unclassified
4.54285192263
control
3.6456600168
0.545349668011
Eubacteriaceae
4.72053565932
control
4.06108447754
0.449691797969
Pseudomonadaceae
2.47280035866
control
2.20503368624
0.146793087309
Erysipelotrichaceae
3.71662926551
control
3.05668482543
0.939742989577
Victivallaceae
0.527629857442
control
2.74539202586
0.317310507863
Tannerellaceae
4.50079062345
test
3.58357798914
0.820595839755
Veillonellaceae
3.87333100809
test
3.19818181055
0.289192119229
Rikenellaceae
4.9250780149
test
4.35390701567
0.545349668011
Lactobacillaceae
3.0789422455
test
2.71335701105
0.412191111952
Helicobacteraceae
2.96591152623
test
2.78958989566
0.317310507863
TABLE: c.csv
Firmicutes_unclassified
4.3949137452
control
3.80320556213
0.202318749895
Eubacteriaceae
4.67973931478
control
3.77752868209
0.914742321683
Tannerellaceae
4.24480153834
control
3.57432876491
0.0575043321308
Veillonellaceae
4.31037823286
control
3.77586730047
0.488677469577
Burkholderiaceae
0
control
2.38754970657
0.356028985238
Streptococcaceae
3.70445898562
test
2.99074124872
0.976675416504
Proteobacteria_unclassified
3.54646811621
test
3.11297394996
0.0769750871296
Erysipelotrichaceae
4.0459816119
test
3.44639785866
0.526971732416
Sutterellaceae
3.52589117576
test
2.94530632209
0.173128322568
Acidaminococcaceae
4.43174037412
test
3.88725876468
0.174392135504
TABLE: d.csv
Firmicutes_unclassified
4.48268161942
control
4.08546779118
0.0378086594704
Sutterellaceae
3.72334956471
control
3.17665465678
0.369158808941
Proteobacteria_unclassified
3.41146437028
control
3.03389982079
0.195865532789
Tannerellaceae
4.51120731157
control
3.82196492147
0.143867728482
Streptococcaceae
3.14327432538
test
2.56433651799
0.89796929457
Eubacteriaceae
4.80334041206
test
4.25388289196
0.488744120387
Veillonellaceae
4.16830871806
test
3.64943977262
0.411122991069
Pseudomonadaceae
0.591920446836
test
2.86633204137
0.317310507863
Caulobacteraceae
0.848425939934
test
2.67734038571
0.317310507863
Now, I want some script for the following tasks:
I want to find out the features with the same sample_type across all the four files and write them in an output file with their name and sample_type.
I will also like to get the features with the same sample_type among three and two out of four files and write them separately in the same file.
It would be great if the corresponding file name of the common features are mentioned.
Why bash? In R, it should be possible by reading the table, then using intersect() and table() to find the features of interest, and subsetting the tables accordingly. In bash, you might be able to get that with some awk magic, but that seems exceedingly complicated, R is just better suited for this. Also, this is an R forum...
Then, just extract the feature columns of each data frame, make entries unique() if needed, c()oncatenate them together, and make a table(). You can very simply filter the table to keep entries that appear 4 times, or 3 or 2 times, and get the corresponding names() of the table: these are the features of interest. So finally you just have to select the corresponding rows in the original data frames.
it means, it reverts only the common features in the first column. But I need the common features across the four tables those match with the third column as well.
Hi @AlexisW- I am trying a different approach to address the problem. Probably you may want to help me. I have extracted out the rows with "control" element from each dataframe. Then extracted out only the first column from it. Resulting into only a single column from each dataframe like:
I have 3 more dataframes like the one above. They are b_control, c_control, d_control. Now I want to make a single table from these three. That means, a table with four columns will be formed. Now, from this table I will like to get the elements which are duplicated 4 times, 3 times and 2 times in the dataframe. I think this will make the problem much easier.
So, can you tell me how can I proceed from this step i.e., how to make a single table from these 4 dataframes and find the elements with those frequencies?