Loop around several data frame

Lina1 · September 29, 2022, 9:23am

Hello Community!

I have a problem with the analysis of multiple data frames. First of all, I have to say that I am quite new in the data analysis so my questions might be naiv or sometimes I do not know the right terms (sorry for that in advance).

So, I have a set of data frames. I want to analyse them and first made a script to transform one of them:

diffDataA <- read.delim("/da/dmp/cb/..._diff.txt")

diff1 <- diffDataA %>%
  dplyr::filter( included == "yes" ) %>%
  dplyr::mutate( compound = "145" )

I am able to perform my analysis with this data set (diff1) but now I want to compare it to other data sets. However, I have almost 100 data sets and now started using the script above for each of them:

diffDataB <- read.delim("/da/dmp/cb/..._diff.txt")
diff2 <- diffDataB %>%
  dplyr::filter( included == "yes" ) %>%
  dplyr::mutate( compound = "195" )

diffDataC <- read.delim("/da/dmp/cb/..._diff.txt")
diff3 <- diffDataC%>%
  dplyr::filter( included == "yes" ) %>%
  dplyr::mutate( compound = "76" )

diffDataD <- read.delim("/da/dmp/cb/..._diff.txt")
diff4 <- diffDataD %>%
  dplyr::filter( included == "yes" ) %>%
  dplyr::mutate( compound = "89" )

diffDataE <- read.delim("/da/dmp/cb/..._diff.txt")
diff5<- diffDataE %>%
  dplyr::filter( included == "yes" ) %>%
  dplyr::mutate( compound = "121" )

analysis <- rbind(diff1, diff2, diff3, diff4, diff5, ...)

But its a lot of typing and in future there will be even more data sets. I know that it somehow must be possible to loop around this data sets but I have no idea how to start. Can someone maybe give me a hind? I would be absolutely happy to learn.

Thanks for any help

nirgrahamuk · September 29, 2022, 10:00am

its not clear if you are obfuscating your names with ... and that they are in fact different files every time; or if you repeatedly readin the same file and take different filtered parts from it . You say that you have 100's of datasets; so I assume that the path inside read.delim would be actually different each time.

We must assume that the filter and mutate you perform on a given data set, are not arbitrary but are chosen due to some logic (that you have not explicated in your post), this would be necessary information to succesfully complete this task...

So the question is, how do you know given a file, what filter and mutate do you want to do ?
is it based off of information in the filename itself; or do you have the relevant info (metadata) in another file ?

FactOREO · September 29, 2022, 10:02am

Hello,

your questions are not naive in the slightest.

However, it would be benefical to know, why (and by what logic) you add a column compound to every data.frame, but with a different number, encoded as chr.

I think it would be better if you add a glimpse of your data to see the structure and explain what the logic behind your mutate() call is. You can add data as a reprex if you call dput(head(Data, 10)) in your R console (where Data is your data.frame, e.g. diffDataA) and then edit your post. The dput(...) result can be added inside a code chunk in the forum by writing

```r
your code here
```

If I understand the logic behind the mutate() call, I think I can help you. For now I could just give you a partial solution:

# create dummy data
test1 <- data.frame(A = seq.default(1,1000,1),
                    B = sample(1:10,1000,replace = TRUE),
                    included = sample(c('yes','no'), 1000,TRUE))
test2 <- data.frame(A = seq.default(1,1000,1),
                    B = sample(1:10,1000,replace = TRUE),
                    included = sample(c('yes','no'), 1000,TRUE,c(.7,.3)))
write.table(test1, file = 'test1.txt',sep = '\t')
write.table(test2, file = 'test2.txt',sep = '\t')
# list all files in directory
files <- list.files(pattern = 'txt$')
# write a function for the task
my_function <- function(dataset_name, filter_condition = 'yes'){
  read.delim(dataset_name) |>
    dplyr::filter(included == filter_condition)
  # no mutate since I do not understand your logic here
}
# apply the function to each element in files
result <- purrr::map(files, my_function)
str(result)
#> List of 2
#>  $ :'data.frame':    503 obs. of  3 variables:
#>   ..$ A       : int [1:503] 2 3 4 6 7 8 10 15 19 22 ...
#>   ..$ B       : int [1:503] 10 6 4 7 4 4 4 6 3 8 ...
#>   ..$ included: chr [1:503] "yes" "yes" "yes" "yes" ...
#>  $ :'data.frame':    703 obs. of  3 variables:
#>   ..$ A       : int [1:703] 1 2 3 5 7 8 9 10 11 12 ...
#>   ..$ B       : int [1:703] 7 5 8 6 1 6 6 2 7 9 ...
#>   ..$ included: chr [1:703] "yes" "yes" "yes" "yes" ...

^{Created on 2022-09-29 with reprex v2.0.2}

The result is a list of all results, which can be combined e.g. with data.table::rbindlist().

Kind regards

Lina1 · September 29, 2022, 10:52am

I am sorry I didn't add more information. I thought that is boring but of course it makes sense to add them, sorry.

Yes, it is always different data frames. They all have the same structure but are results from different measurements. The data frames contain genetic information and look like that:

 coorsID                     gene_id        gene_name    altsplice_pvalue altsplice_padj included
1      1:11981810-11981853:+ ENSG00000116688      MFN2     4.629304e-44   6.896232e-40      yes
2      1:11981854-11981969:+ ENSG00000116688      MFN2     9.342263e-21   3.827198e-17      yes
3      1:45341710-45341743:+ ENSG00000132773      TOE1     7.916213e-15   1.601479e-11      no
4      1:45341744-45341948:+ ENSG00000132773      TOE1     1.434851e-04   5.002624e-02      yes
5      1:46357576-46357659:+ ENSG00000117481     NSUN4     1.555532e-06   8.820026e-04      yes
6      1:53947862-53947981:+ ENSG00000116212    LRRC42     4.717038e-17   1.288270e-13      no
7      1:66585672-66585674:+ ENSG00000118473     SGIP1     6.906462e-07   4.319597e-04      yes
8      1:66585675-66585725:+ ENSG00000118473     SGIP1     4.016151e-09   3.576688e-06      yes

What I want to do: I want to have one data frame in the end that contain all the genetic information of all measurements. I only need the infos of the genes that has been included, so I did the filter(included=="yes") command. In the end I still want to know for each row where that data came from, so I created a new column that contains that info: mutate(compound="xy"). The compound info cannot be found in the metadata, only in the file name.
And then I just bound the resulting data frames together to one big one. I hope that makes sense.

FactOREO · September 29, 2022, 11:08am

This was an important extra bit of information. Please provide the structure of the filenames, because you can add a bit in the function which extracts the relevant number of your file name and adds it into the data.frame as a column. But since the structure could be anything, I could only guess

Lina1 · September 29, 2022, 11:30am

So sorry for again not providing all infos. I am so grateful for any help.

Ok, that's one example of filenames: AAC890_DMSO_vs_EJZ908-655nM_diff.txt
"AAC890" is the experiment number and "EJZ908" is the compound, 655mM the concentration. In my dreams I would have an extra column containing the compound info (compound and concentration).

I am really trying my best to make a loop function but I only know how to do that with a vector.

Thank you

FactOREO · September 29, 2022, 12:27pm

I decided to go with the small data sets I created beforehand, but this should not effect the solution for your real data. Just combine both strings with paste() if you wish to, but I thought it would be benefical if you keep them separated.

# get all .txt files from directory
files <- list.files(pattern = 'txt$')
files
#> [1] "AAC890_DMSO_vs_EJZ908-655nM_diff.txt"
#> [2] "AAC990_DMSO_vs_EJZ908-655nM_diff.txt"

my_function <- function(dataset_name, filter_condition = 'yes'){
  # extract compound (everything before the first underscore)
  compound <- stringr::str_extract(string = dataset_name, pattern = '^[^_]+')
  # extract concentration (everything in between a minus and an underscore)
  concentration <- gsub(".*-(.+)_.*", "\\1", dataset_name)
  
  read.delim(dataset_name) |>
    dplyr::filter(included == filter_condition) |> 
    dplyr::mutate(
      # as.character() to avoid conflicts of different formats
      compound = as.character(compound),
      concentration = concentration
    )
}

result <- purrr::map(files, my_function) |>
  data.table::rbindlist()
head(result)
#>    A  B included compound concentration
#> 1: 2 10      yes   AAC890         655nM
#> 2: 3  6      yes   AAC890         655nM
#> 3: 4  4      yes   AAC890         655nM
#> 4: 6  7      yes   AAC890         655nM
#> 5: 7  4      yes   AAC890         655nM
#> 6: 8  4      yes   AAC890         655nM

^{Created on 2022-09-29 with reprex v2.0.2}

Kind regards

Edit: If not all tables have the same column order, use data.table::rbindlist(use.names = TRUE). This makes sure that you only combine columns with the same name and the order is irrelevant. If there are columns that don't match for some reason, there is the option fill = TRUE to fill nonmatching columns with NA.

Lina1 · September 30, 2022, 8:09am

Wow. That makes so much sense. Thank you so much.

I have only one question left. What happens if I want to add a data set that is not found under the same direction?

FactOREO · September 30, 2022, 11:06am

Then you would need to split the workflow (do it seperate for all directories) and combine all the results afterwards. Or you copy the files into one directory.

If you wish another (automated) solution, you should open a new request. But I think just putting all files in one directory is the easiest solution.

Kind regards

system · October 7, 2022, 11:06am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.