I am currently working with several hundred files that I have stored in a single folder. These files, which are in ".tsv" format, all containing the same metabolic pathway information. Each of these files has the same number of columns (a total of 11).
The main idea is that these files contain the molecules involved in each pathway, which I need to encode for their representation. This encoding would be the combination of: the name of the file (containing the name of the metabolic pathway) + a sequential number for each unique term in each pathway.
An example of how the content of the files would look like would be the following:
|filenames |
|:-----------------------------------------------------|
|PA145011109-Atorvastatin_Pathway_Pharmacokinetics.tsv |
|PA145011110-Pravastatin_Pathway_Pharmacokinetics.tsv |
|PA145011111-Fluvastatin_Pathway_Pharmacokinetics.tsv |
|... |
That means that from the above example, in the case of "Atorvastatin" I would be left with PA145011109 + "-" + sequential numbers (e.g. 01, 02, 03....).
My intention would be to: 1) to be able to automate the process of reading all the the files into different data frames, 2) while adding to each data frame a new columns the corresponding file name, 3) to extract unique values to be able to create a dictionary of terms for the encoding.
In my mind the process would be as follows:
-
Loop to import all the files:
##### Import multiple files all at once #Read files named PA___.tsv filenames <- list.files(path = table_pgkb, pattern="PA+.*tsv") #Create list of data frame names without the ".tsv" part fnames <-str_remove(filenames, pattern = "\\.tsv$") #Create a list of the codes for every path using rebus pattern = "PA" %R% one_or_more(DGT) pathnames <- str_extract(fnames, pattern) #Load all files and unify format for(i in fnames){ filepath <- file.path(table_pgkb,paste(i,".tsv",sep="")) assign(i, read.delim(filepath, colClasses=c(rep("character",4), rep("NULL", 7)), sep = "\t")) }
After this, every data frame I get look like this (they all have the same exact format):
From | To | Genes |
---|---|---|
atorvastatin lactone | 2-hydroxyatorvastatin lactone | CYP3A4, CYP3A5 |
atorvastatin lactone | 4-hydroxyatorvastatin lactone | CYP3A4. |
atorvastatin. | 2-hydroxyatorvastatin. | CYP2C8. |
... |
- The idea is to include the file name to all the data frames, so each one looks like this after the loop for read.delim:
filename | From | To | Genes |
---|---|---|---|
PA145011109 | atorvastatin lactone | 2-hydroxyatorvastatin lactone | CYP3A4, CYP3A5 |
PA145011109 | atorvastatin lactone | 4-hydroxyatorvastatin lactone | CYP3A4. |
PA145011109 | atorvastatin | 2-hydroxyatorvastatin | CYP2C8, CYP3A4 |
... |
-
Then I extract unique values:
##### Extract unique values from each data frame PA145011109 <- unique(c(PA145011109$From, PA145011109$To)) %>% sort() %>% as.data.frame()
My final aim is to get a dictionary of terms, as follows:
coded.mol | molecules. |
---|---|
PA145011109-01 | atorvastatin lactone |
PA145011109-02 | 2-hydroxyatorvastatin lactone |
PA145011109-03 | 4-hydroxyatorvastatin lactone |
PA145011109-04 | atorvastatin |
PA145011109-05 | 2-hydroxyatorvastatin |
... |
I have never done anything like this with multiple files, so I may be making the analysis much more complicated than it should be.
I have tried the alternatives proposed here (r - Insert a column with file name - Stack Overflow), here (R - Adding filepath to each row of a dataframe - Stack Overflow), and here (R - Adding filepath to each row of a dataframe - Stack Overflow) combined with (r - Read multiple csv data and create new columns at one time - Stack Overflow), (read.table - Read multiple files, create a data frame and add a new column containing the name of each file in R - Stack Overflow), and (read.table - Read multiple files, create a data frame and add a new column containing the name of each file in R - Stack Overflow), but don´t get to make it work inside the loop.
Also have tried (r - Create a new data frame that will act as a dictionary with key and value pairs - Stack Overflow), (R: Transform dataframe column using dictionary/list? - Stack Overflow), and (How to use a dictionary for a large data frame in R? - Stack Overflow) to create the dictionary.
I think it might have more to do with my approach to the problem. I don't know what your thoughts of the whole process are.
Any suggestion would be appreciate. Thank you all for your time.