Hi all,
I'm trying to learn how to do this in tidyverse. I have a function (TPM) which I want to apply to each row of a tibble. I also want to make sure that the sorting order of column 1 (gene) does not effect the result. I'm assuming across() may be involved, but I'm not seeing it. I also have tried a join first, but I'm just getting stuck. What is tripping me up is how to deal with the "gene" column i.e. apply the function to all columns of gene_counts except the gene column, while also ensuring that the correct gene length is applied to the correct gene count (can't always assume the sorting order of both tibbles will be the same)
The correct values in the results are:33
gene sample1 sample2
A 281690.14 486.618
B 70422.54 973.2363
C 84507.04 973.236
D 211267.61 24330.900
E 352112.68 973236.010
Thanks all.
Kenneth
# load dplyr
library(dplyr)
# Create TPM function
tpm <- function(counts, lengths) {
rate <- counts / lengths
rate / sum(rate) * 1e6
}
# gene lengths
gene_lengths <- tibble(
gene = c("A","B","C","D","E"),
length = c(100, 50, 25, 5, 1)
)
gene_lengths
# gene counts
gene_counts <- tibble(
gene = c("A","B","C","D","E"),
sample1 = c(80, 10, 6, 3, 1),
sample2 = c(20, 20, 10, 50, 400)
)
gene_counts
# join
left_join(gene_lengths, gene_counts)
EDIT #1:
Rather than deleting the above code, I have added improvements below. I am closer to what I want but it feels long-winded. I should have also pointed out that I am looking to save the results as a new tibble, not to edit the original gene_counts tibble. In the code below I use base-R apply, I would prefer to stay within tidyverse so it can still be improved.
Load dplyr and create TPM function as above and then....
# gene lengths
gene_lengths <- tibble(
gene = c("A", "C", "D", "E", "B"),
length = c(100, 25, 5, 1, 50)
) |> arrange(gene) # sort by gene
gene_lengths
# gene counts
gene_counts <- tibble(
gene = c("A", "B", "D", "E", "C"),
sample1 = c(80, 10, 3, 1, 6),
sample2 = c(20, 20, 50, 400, 10)
) |> arrange(gene) # sort by gene
gene_counts
# create tpm tibble
gene_tpm <- gene_counts |>
select(!gene) |>
apply(2, function(x) tpm(x, gene_lengths$length)) |>
as.data.frame() |>
tibble() |>
round() |>
mutate(gene = gene_lengths$gene) |>
relocate(gene)
gene_tpm
EDIT #2
Thanks everyone for your help. Apologies, for the waste of time, another thing I should have highlighted. Specifically naming the columns to select is not feasible as there are nearly 100 in the real tibble, and they have names like ileum_1, duodenum_234. Pattern matching the name isn't likely to work, so "not selecting gene column" is preferred to "naming all columns that should be selected".
Getting closer: Same as above except for last block:
# create tpm tibble
gene_counts |>
summarise(
across(.cols = !starts_with("g"), .fns = tpm(x, gene_lengths$length ))
)
x is a place holder, this is what is catching me out now.
SOLVED:
# create tpm tibble
gene_tpm <- gene_counts |>
mutate(
across(.cols = !starts_with("gene"),
.fns = function(col){tpm(col, gene_lengths$length)}
)
)
gene_tpm # the decimal places are not lost, just not printed
gene_tpm$sample1
You know, I only come here when I'm absolutely stuck, and you legends always get me out.
Thank you @dromano and @prubin for the correct function call, and switch to mutate.
Thanks all for your help
Kenneth