Help with understanding functions

ctheisen · July 12, 2023, 9:47pm

I'm trying to figure out what the two following lines of code are doing. I'm going to guess at what they are doing, but please correct me, as I am getting NA values in my data where I don't expect to.

This function takes a cell in the dataframe (x), then finds the median of a given sequence (seq)?

e.function <- function(x, seq) tapply(x, seq, median)

The following takes the dataframe sdat[,1] (so removes the first column of dataframe sdat), and it goes down the column and applies e.function (I'm not sure what the x and seq are for this function), then returns a value of sdat from the first column? Sorry, this one is really confusing me.

  temp <- apply(sdat[,-1], 2, e.function,seq=sdat[, 1])

If anyone can explain to me what is going on exactly for these two functions, that would be really helpful. I'm not great at writing and using functions.

AlexisW · July 12, 2023, 11:02pm

It's harder to answer that question in general than knowing what your data looks like. So I'll cheat as I've read your other question.

Let's start with 2.

apply() takes a data frame (or similar), and applies an operation on its rows or columns. Here we use apply(..., 2, ...) so we apply the function on its columns. For example:

X <- data.frame(x1 = 1:3,
                x2 = 4:6)
X
#>   x1 x2
#> 1  1  4
#> 2  2  5
#> 3  3  6

apply(X, 2, min)
#> x1 x2 
#>  1  4
apply(X, 2, max)
#> x1 x2 
#>  3  6

^{Created on 2023-07-12 with reprex v2.0.2}

So an apply() is a way to make a loop. In other words, apply(X, 2, max) means "take X, and for each column of X take the max".

Here we have:

apply(sdat[,-1], 2, e.function, seq=sdat[, 1])

That can be translated in "Take sdat[,-1], and for each column of sdat[,-1] take the function e.function()". But, as we'll see in a second, e.function() requires two parameters, x and seq. So, x will be each column of sdat[,-1], but we also need to provide seq. We can give it as the 4th argument: seq = sdat[,1], that means the first column of sdat, which is Sequence.

So, what this does is, for each column of sdat except the first, pass that column as x and the first column as seq and apply e.function().

Now let's go to 1. and the definition of e.function(). I should say tapply() can be used in many ways, and can be very confusing. Here, we have a single case where both of its inputs are a vector (a single column of sdat).

tapply() takes argument X, a data vector, and INDEX, a grouping factor. It uses the grouping factor to "split" the data, and applies a function to each of the groups:

x <- 1:7
fac <- list(c("a","a","a","a","b","b","b"))

tapply(x, fac, min)
#> a b 
#> 1 5
tapply(x, fac, max)
#> a b 
#> 4 7

Finally, let's put it back together:

e.function <- function(x, seq) tapply(x, seq, median)
temp <- apply(sdat[,-1], 2, e.function, seq=sdat[, 1])

What this does is take sdat, and separate the first column which has protein sequences from the other columns which contain data. Then, for each data column, it takes the median by peptide.

ctheisen · July 13, 2023, 2:18pm

So it is taking the median down the column of all values that have the same sdat[,1] value? So like it selects the protein group, then finds the median for each column in that group and puts that into the new dataframe temp?

Also, I appreaciate your help on both of these questions. I'm still a bit of a novice at coding, but am trying to get better. I appreciate your patience.

AlexisW · July 13, 2023, 2:43pm

No problem, we've all been there! To be honest, this is not the clearest code there is, it feels that was written by someone used to R who only wanted to get things done quickly, not to make them easy to read for novices.

Yes, but note that they never use the values in temp, so it actually doesn't matter. Here is a commented version:

dat <- data.frame(Protein.Group.Accessions = c("A","A","A","A",
                                               "B","B","B","B"),
                  Sequence = c("AAAESMIPTTTGAAK","AAAESMIPTTTGAAK",
                               "AAAVSNIIR","AAAVSNIIR",
                               "AAAESMIPTTTGAAK","AAAESMIPTTTGAAK",
                               "AAAVSNIIR","AAAVSNIIR"),
                  col1 = 1:8,
                  col2 = 20:13)
cha <- c("col1", "col2")
e.function <- function(x, seq) tapply(x, seq, median)

dat$Sequence <- toupper(dat$Sequence)
accessions <- as.character(unique(dat$Protein.Group.Accessions))
n.proteins <- length(accessions)
n.cha <- length(cha)

output <- NULL

# for each Accession
for(k in 1:n.proteins){
  id <- accessions[k]
  # keep only the rows for the first Accession, keep only the Sequence and data columns
  sdat <- subset(dat, Protein.Group.Accessions==id)[c("Sequence", cha)]
  # log2 of all data
  sdat[cha] <- log2(sdat[cha])
  # subtract the median of each row
  sdat[cha] <- sdat[cha] - apply(sdat[cha], 1, median)
  # only the data columns
  pdat <- sdat[, -1]
  # nb spectra is the number of rows
  n.spectra <- ifelse(is.integer(dim(pdat)), nrow(pdat), 1)
  # take median by Sequence, note, this is saved in `temp`, but temp is NOT used below
  temp <- apply(sdat[,-1], 2, e.function, seq=sdat[, 1])  
  # n.peptides is the number of different Sequences
  n.peptides <- ifelse(is.integer(dim(temp)), nrow(temp), 1)    
  # take the median of each column
  if(is.integer(dim(pdat))) pdat <- apply(pdat, 2, median)
  # keep the data columns, add columns for nb peptides and spectra
  pdat <- c(pdat, n.peptides=n.peptides, n.spectra=n.spectra)
  # for the first protein, `output` is NULL, then each subsequent protein gets added as a new row
  output <- rbind(output, pdat)
}

All the ifelse(is.integer(dim(...)) are to count the number of rows handling the fact it could be 1.

system · July 20, 2023, 2:43pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.