Help with understanding functions

AlexisW · July 12, 2023, 11:02pm

It's harder to answer that question in general than knowing what your data looks like. So I'll cheat as I've read your other question.

Let's start with 2.

apply() takes a data frame (or similar), and applies an operation on its rows or columns. Here we use apply(..., 2, ...) so we apply the function on its columns. For example:

X <- data.frame(x1 = 1:3,
                x2 = 4:6)
X
#>   x1 x2
#> 1  1  4
#> 2  2  5
#> 3  3  6

apply(X, 2, min)
#> x1 x2 
#>  1  4
apply(X, 2, max)
#> x1 x2 
#>  3  6

^{Created on 2023-07-12 with reprex v2.0.2}

So an apply() is a way to make a loop. In other words, apply(X, 2, max) means "take X, and for each column of X take the max".

Here we have:

apply(sdat[,-1], 2, e.function, seq=sdat[, 1])

That can be translated in "Take sdat[,-1], and for each column of sdat[,-1] take the function e.function()". But, as we'll see in a second, e.function() requires two parameters, x and seq. So, x will be each column of sdat[,-1], but we also need to provide seq. We can give it as the 4th argument: seq = sdat[,1], that means the first column of sdat, which is Sequence.

So, what this does is, for each column of sdat except the first, pass that column as x and the first column as seq and apply e.function().

Now let's go to 1. and the definition of e.function(). I should say tapply() can be used in many ways, and can be very confusing. Here, we have a single case where both of its inputs are a vector (a single column of sdat).

tapply() takes argument X, a data vector, and INDEX, a grouping factor. It uses the grouping factor to "split" the data, and applies a function to each of the groups:

x <- 1:7
fac <- list(c("a","a","a","a","b","b","b"))

tapply(x, fac, min)
#> a b 
#> 1 5
tapply(x, fac, max)
#> a b 
#> 4 7

Finally, let's put it back together:

e.function <- function(x, seq) tapply(x, seq, median)
temp <- apply(sdat[,-1], 2, e.function, seq=sdat[, 1])

What this does is take sdat, and separate the first column which has protein sequences from the other columns which contain data. Then, for each data column, it takes the median by peptide.