Applying a function to a tibble

bioinfguru · April 19, 2024, 2:10pm

Hi all,

I'm trying to learn how to do this in tidyverse. I have a function (TPM) which I want to apply to each row of a tibble. I also want to make sure that the sorting order of column 1 (gene) does not effect the result. I'm assuming across() may be involved, but I'm not seeing it. I also have tried a join first, but I'm just getting stuck. What is tripping me up is how to deal with the "gene" column i.e. apply the function to all columns of gene_counts except the gene column, while also ensuring that the correct gene length is applied to the correct gene count (can't always assume the sorting order of both tibbles will be the same)

The correct values in the results are:33

gene sample1 sample2
A 281690.14 486.618
B 70422.54 973.2363
C 84507.04 973.236
D 211267.61 24330.900
E 352112.68 973236.010

Thanks all.
Kenneth

# load dplyr
library(dplyr)

# Create TPM function
tpm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(rate) * 1e6
}

# gene lengths
gene_lengths <- tibble(
  gene = c("A","B","C","D","E"),
  length = c(100, 50, 25, 5, 1)
)
gene_lengths

# gene counts
gene_counts <- tibble(
  gene = c("A","B","C","D","E"),
  sample1 = c(80, 10, 6, 3, 1),
  sample2 = c(20, 20, 10, 50, 400)
)
gene_counts

# join
left_join(gene_lengths, gene_counts)

EDIT #1:
Rather than deleting the above code, I have added improvements below. I am closer to what I want but it feels long-winded. I should have also pointed out that I am looking to save the results as a new tibble, not to edit the original gene_counts tibble. In the code below I use base-R apply, I would prefer to stay within tidyverse so it can still be improved.

Load dplyr and create TPM function as above and then....

# gene lengths
gene_lengths <- tibble(
  gene = c("A", "C", "D", "E", "B"),
  length = c(100, 25, 5, 1, 50)
) |> arrange(gene) # sort by gene
gene_lengths 

# gene counts
gene_counts <- tibble(
  gene = c("A", "B", "D", "E", "C"),
  sample1 = c(80, 10, 3, 1, 6),
  sample2 = c(20, 20, 50, 400, 10)
) |> arrange(gene) # sort by gene
gene_counts 

# create tpm tibble
gene_tpm <- gene_counts |>
  select(!gene) |>
  apply(2, function(x) tpm(x, gene_lengths$length)) |>
  as.data.frame() |>
  tibble() |>
  round() |>
  mutate(gene = gene_lengths$gene) |>
  relocate(gene)
gene_tpm

EDIT #2
Thanks everyone for your help. Apologies, for the waste of time, another thing I should have highlighted. Specifically naming the columns to select is not feasible as there are nearly 100 in the real tibble, and they have names like ileum_1, duodenum_234. Pattern matching the name isn't likely to work, so "not selecting gene column" is preferred to "naming all columns that should be selected".

Getting closer: Same as above except for last block:

# create tpm tibble
gene_counts |>
  summarise(
  across(.cols = !starts_with("g"), .fns = tpm(x, gene_lengths$length ))
)

x is a place holder, this is what is catching me out now.

SOLVED:

# create tpm tibble
gene_tpm <- gene_counts |>
  mutate(
    across(.cols = !starts_with("gene"), 
           .fns = function(col){tpm(col, gene_lengths$length)}
      )
    )
gene_tpm # the decimal places are not lost, just not printed
gene_tpm$sample1

You know, I only come here when I'm absolutely stuck, and you legends always get me out.
Thank you @dromano and @prubin for the correct function call, and switch to mutate.

Thanks all for your help
Kenneth

prubin · April 19, 2024, 4:39pm

This should work. The join operation will ensure that the counts and lengths are matched by gene, regardless of the order in the second tibble. Your target values are in columns labeled "s1" and "s2".

merged <- left_join(gene_lengths, gene_counts) |>
  mutate(s1 = tpm(sample1, length), s2 = tpm(sample2, length))

bioinfguru · April 19, 2024, 5:19pm

Thank you. I was about to edit the original question when I just saw your post. I have been able to get close to the result I am looking for but it feels a bit long winded and I have used base-r apply when I would prefere to use tidyverse. I should have added also, that I want the result to be in a new tibble. See Edit #1.

dromano · April 19, 2024, 5:37pm

Here is an alternative tidyverse translation of @prubin's solution:

create function 'tpm' and tables 'gene_length' and 'gene_counts' (click to access)

# load dplyr
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Create TPM function
tpm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(rate) * 1e6
}

# gene lengths
gene_lengths <- tibble(
  gene = c("A","B","C","D","E"),
  length = c(100, 50, 25, 5, 1)
)
gene_lengths
#> # A tibble: 5 × 2
#>   gene  length
#>   <chr>  <dbl>
#> 1 A        100
#> 2 B         50
#> 3 C         25
#> 4 D          5
#> 5 E          1

# gene counts
gene_counts <- tibble(
  gene = c("A","B","C","D","E"),
  sample1 = c(80, 10, 6, 3, 1),
  sample2 = c(20, 20, 10, 50, 400)
)
gene_counts
#> # A tibble: 5 × 3
#>   gene  sample1 sample2
#>   <chr>   <dbl>   <dbl>
#> 1 A          80      20
#> 2 B          10      20
#> 3 C           6      10
#> 4 D           3      50
#> 5 E           1     400

gene_lengths |> 
  left_join(gene_counts) |> 
  mutate(
    across(
      contains('sample'),
      function(col){
        tpm(col, length)  
      }
    )
  )
#> Joining with `by = join_by(gene)`
#> # A tibble: 5 × 4
#>   gene  length sample1 sample2
#>   <chr>  <dbl>   <dbl>   <dbl>
#> 1 A        100 281690.    487.
#> 2 B         50  70423.    973.
#> 3 C         25  84507.    973.
#> 4 D          5 211268.  24331.
#> 5 E          1 352113. 973236.

^{Created on 2024-04-19 with reprex v2.0.2}

prubin · April 19, 2024, 5:53pm

Given the edit, you might prefer the following.

gene_tpm <- left_join(gene_lengths, gene_counts) |>
               mutate(sample1 = tpm(sample1, length) |> round(), 
                               sample2 = tpm(sample2, length) |> round()) |>
               select(gene, sample1, sample2)

dromano · April 19, 2024, 6:11pm

The function tpm() produces an output vector of the same length as its input vectors, so you would want to use mutate() instead of summarise(), and the .fns argument should be a function, rather that the evaluation of a function. You could try this, instead:

.fns =  function(col) {tpm(col, gene_lengths$length)}

system · April 26, 2024, 6:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.