Help with BiocManager

ctheisen · July 12, 2023, 7:52pm

So I'm trying to use the limma and qvalue libraries as explained at the following url: https://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/R_guide.html

I've already figured out I need to install the packages differently from running:

source("http://bioconductor.org/biocLite.R")
biocLite()

and instead ran:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("limma")
BiocManager::install("qvalue")

And this seem to work fine. But now I am getting to the step:

source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/read.peptides.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/quantify.proteins.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.r")

# I ran the following and received the output

BiocManager::install(c("read.peptides","quantify.proteins","eb.fit"))
'getOption("repos")' replaces Bioconductor
standard repositories, see 'help("repositories",
package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://ftp.osuosl.org/pub/cran/
Bioconductor version 3.17 (BiocManager 1.30.21),
  R 4.3.1 (2023-06-16 ucrt)
Installing package(s) 'read.peptides',
  'quantify.proteins', 'eb.fit'
Warning message:
packages ‘read.peptides’, ‘quantify.proteins’, ‘eb.fit’ are not available for Bioconductor version '3.17'

Versions of these packages for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

Any help on how to get these functions loaded?

AlexisW · July 12, 2023, 8:06pm

ctheisen:

source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/read.peptides.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/quantify.proteins.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.r")

These are direct links to scripts hosted on their website. The function source() can directly read these scripts. So you don't need BiocManager for that, just run these three source() commands, and the functions should be available for your use, e.g. with

dat <- read.peptides(dat, cha)

ctheisen · July 12, 2023, 8:09pm

I did figure that out right after posting this, but now I'm having issues running the read.peptides function, as shown below:

dat<-read.peptides(dat,cha)
Error in `[.data.frame`(dat, cha) : undefined columns selected
In addition: Warning message:
In read.peptides(dat, cha) : NAs introduced by coercion

I double checked that all my previous columns are identical to that used in their example, and that they match the column names in the function:

read.peptides <- function(dat, cha){
  output <- NULL
  
  dat$Sequence <- as.character(dat$Sequence)
  dat$Protein.Group.Accessions <- as.character(dat$Protein.Group.Accessions)
  dat$Quan.Usage <- as.character(dat$Quan.Usage)
  dat$Quan.Info <- as.character(dat$Quan.Info)
  dat$Isolation.Interference <- as.numeric(as.character(dat$Isolation.Interference))
  
  dat <- subset(dat, Isolation.Interference<=30)  
  dat <- subset(dat, Quan.Usage=="Used")
  dat <- subset(dat, Protein.Group.Accessions!="")
  dat <- subset(dat, !apply(dat[cha], 1, f <- function(x) any(is.na(x))))  
}

So I am confused what it means when it says there are undefined columns

AlexisW · July 12, 2023, 8:23pm

Is it with their data or your own? I don't encounter this error with their example data

ctheisen · July 12, 2023, 8:35pm

I didn't try with theirs, but I did download their data and made sure I had the same column names in the correct order. Except in place of their sample names ("X126", "X127_N", "X127_C", "X128_N", "X128_C", "X129_N", "X129_C", "X130_N", "X130_C", "X131")

I replaced these with my own sample names:

> dput(head(dat))
structure(list(Protein.Group.Accessions = c("A0A7U9GJV6", "A0A7U9GJV6", 
"A0A7U9GJV6", "A0A7U9GJV6", "A0A7U9GJV6", "A0A7U9GJV6"), Protein.Descriptions = c("", 
"", "", "", "", ""), Sequence = c("AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", 
"AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK"
), Quan.Usage = c("Used", "Used", "Used", "Used", "Used", "Used"
), Quan.Info = c("Unique", "Unique", "Unique", "Unique", "Unique", 
"Unique"), Isolation.Interference = c("0.202341629", "0.418291109", 
"0.01009192", "0.007835206", "0.098607589", "0.044367289"), pH8_2000_13_a = c(44019, 
0, 136990, 316930, 77580, 60480), pH8_2000_13hr_b = c(17831, 
0, 71621, 87466, 43244, 22568), pH8_2000_13_c = c(13292, 0, 48171, 
49619, 28530, 14063), pH8_6_13_a = c(39319, 0, 121430, 152730, 
72029, 70634), pH8_6_13hr_b = c(38170, 0, 128590, 220140, 70582, 
63551), pH8_6_13_c = c(38196, 0, 126190, 280720, 77592, 57076
), pH8_0_13_a = c(36996, 0, 111680, 266300, 66938, 65370), pH8_0_13hr_b = c(21855, 
0, 67175, 120250, 40979, 32068), pH8_0_13_c = c(17172, 0, 63511, 
117330, 49581, 21080), pH7_2000_13_a = c(39033, 0, 131990, 172970, 
80607, 82387), pH7_2000_13hr_b = c(27043, 0, 85080, 162510, 49483, 
49123), pH7_2000_13_c = c(34940, 0, 115260, 244700, 65388, 51025
), pH7_6_13_a = c(20498, 0, 70290, 103430, 39587, 26608), pH7_6_13hr_b = c(14491, 
0, 55730, 76646, 32676, 14832), pH7_6_13_c = c(18158, 0, 59825, 
89469, 36735, 27595), pH7_0_13_a = c(36396, 0, 116410, 152450, 
67804, 118380), pH7_0_13hr_b = c(33008, 0, 115500, 207020, 72604, 
50188), pH7_0_13_c = c(39070, 0, 124100, 241630, 69292, 57035
)), row.names = c(NA, 6L), class = "data.frame")

ctheisen · July 12, 2023, 8:44pm

Nevermind, I was able to fix it. Thanks

ctheisen · July 12, 2023, 9:12pm

Actually, I take that back. Now as I am working through the functions, I run the quantify.peptides(dat,cha) function and it turns all my values into NA values, so then when I run the boxplot command there are no values.

> dat <- read.csv("E:\\2023LDRD proteomics\\LDRD2023_proteomics_set1\\txt_setT1\\peptide_format_for_R3.csv") # load data
> View(dat)
> dim(dat) # dimension of dataframe
[1] 15339    24
> str(dat) # structure of dataframe
'data.frame':	15339 obs. of  24 variables:
 $ Protein.Group.Accessions: chr  "A0A7U9GJV6" "A0A7U9GJV6" "A0A7U9GJV6" "A0A7U9GJV6" ...
 $ Protein.Description     : chr  "" "" "" "" ...
 $ Sequence                : chr  "AAAESMIPTTTGAAK" "AAAESMIPTTTGAAK" "AAAESMIPTTTGAAK" "AAAESMIPTTTGAAK" ...
 $ Quan.Usage              : chr  "Used" "Used" "Used" "Used" ...
 $ Quan.Info               : chr  "Unique" "Unique" "Unique" "Unique" ...
 $ Isolation.Interference  : chr  "0.202341629" "0.418291109" "0.01009192" "0.007835206" ...
 $ pH8_2000_13_a           : num  44019 0 136990 316930 77580 ...
 $ pH8_2000_13hr_b         : num  17831 0 71621 87466 43244 ...
 $ pH8_2000_13_c           : num  13292 0 48171 49619 28530 ...
 $ pH8_6_13_a              : num  39319 0 121430 152730 72029 ...
 $ pH8_6_13hr_b            : num  38170 0 128590 220140 70582 ...
 $ pH8_6_13_c              : num  38196 0 126190 280720 77592 ...
 $ pH8_0_13_a              : num  36996 0 111680 266300 66938 ...
 $ pH8_0_13hr_b            : num  21855 0 67175 120250 40979 ...
 $ pH8_0_13_c              : num  17172 0 63511 117330 49581 ...
 $ pH7_2000_13_a           : num  39033 0 131990 172970 80607 ...
 $ pH7_2000_13hr_b         : num  27043 0 85080 162510 49483 ...
 $ pH7_2000_13_c           : num  34940 0 115260 244700 65388 ...
 $ pH7_6_13_a              : num  20498 0 70290 103430 39587 ...
 $ pH7_6_13hr_b            : num  14491 0 55730 76646 32676 ...
 $ pH7_6_13_c              : num  18158 0 59825 89469 36735 ...
 $ pH7_0_13_a              : num  36396 0 116410 152450 67804 ...
 $ pH7_0_13hr_b            : num  33008 0 115500 207020 72604 ...
 $ pH7_0_13_c              : num  39070 0 124100 241630 69292 ...
> cha<-c('pH8_2000_13_a','pH8_2000_13hr_b','pH8_2000_13_c','pH8_6_13_a',
+        'pH8_6_13hr_b','pH8_6_13_c','pH8_0_13_a','pH8_0_13hr_b',
+        'pH8_0_13_c','pH7_2000_13_a',
+        'pH7_2000_13hr_b','pH7_2000_13_c','pH7_6_13_a','pH7_6_13hr_b',
+        'pH7_6_13_c','pH7_0_13_a','pH7_0_13hr_b','pH7_0_13_c') # defines vector of headlines of all channels as they are store in data set
> # from : http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/read.peptides.r
> read.peptides <- function(dat, cha){ # excludes rows (spectra) with missing values, higher isolation interference than 30% and that are redundant or not used
+   output <- NULL
+   
+   dat$Sequence <- as.character(dat$Sequence)
+   dat$Protein.Group.Accessions <- as.character(dat$Protein.Group.Accessions)
+   dat$Quan.Usage <- as.character(dat$Quan.Usage)
+   dat$Quan.Info <- as.character(dat$Quan.Info)
+   dat$Isolation.Interference <- as.numeric(as.character(dat$Isolation.Interference))
+   
+   dat <- subset(dat, Isolation.Interference<=30)  
+   dat <- subset(dat, Quan.Usage=="Used")
+   dat <- subset(dat, Protein.Group.Accessions!="")
+   dat <- subset(dat, !apply(dat[cha], 1, f <- function(x) any(is.na(x))))  
+ }
> # from : http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/quantify.proteins.r
> quantify.proteins <- function(dat, cha){ # quantifies all proteins that are represented by the peptide spectra. each protein in the channel are log2 transformed from the median value of all peptides belonging to each protein
+   e.function <- function(x, seq) tapply(x, seq, median)
+   output <- NULL
+   
+   dat$Sequence <- toupper(dat$Sequence)
+   accessions <- as.character(unique(dat$Protein.Group.Accessions))
+   n.proteins <- length(accessions)
+   n.cha <- length(cha)
+   
+   for(k in 1:n.proteins){
+     id <- accessions[k]
+     sdat <- subset(dat, Protein.Group.Accessions==id)[c("Sequence", cha)]
+     sdat[cha] <- log2(sdat[cha])
+     sdat[cha] <- sdat[cha] - apply(sdat[cha], 1, median)
+     pdat <- sdat[, -1]
+     n.spectra <- ifelse(is.integer(dim(pdat)), nrow(pdat), 1)
+     temp <- apply(sdat[,-1], 2, e.function,seq=sdat[, 1])          
+     n.peptides <- ifelse(is.integer(dim(temp)), nrow(temp), 1)    
+     if(is.integer(dim(pdat))) pdat <- apply(pdat, 2, median)
+     pdat <- c(pdat, n.peptides=n.peptides, n.spectra=n.spectra)
+     output <- rbind(output, pdat)
+   }
+   output[,1:n.cha] <- sweep(output[,1:n.cha],1,apply(output[,1:n.cha],1,median))
+   output[,1:n.cha] <- sweep(output[,1:n.cha],2,apply(output[,1:n.cha],2,median))
+   output[,1:n.cha] <- sweep(output[,1:n.cha],1,apply(output[,1:n.cha],1,median))
+   output[,1:n.cha] <- round(output[,1:n.cha],3)
+   row.names(output) <- accessions
+   output <- as.data.frame(output)
+   
+   return(output)
+ }
> # from : http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.r
> eb.fit <- function(dat, design){ # eb = empirical Bayes; performs stats analysis of interest - 2 sample t-tests. Out put is a dataframe that contains log2 for each protein and its ordinary as well as moderated p- and q-values. Results are sorted by p-values in increasing order
+   n <- dim(dat)[1]
+   fit <- lmFit(dat, design)
+   fit.eb <- eBayes(fit)
+   log2FC <- fit.eb$coefficients[, 2]
+   df.r <- fit.eb$df.residual
+   df.0 <- rep(fit.eb$df.prior, n)
+   s2.0 <- rep(fit.eb$s2.prior, n)
+   s2 <- (fit.eb$sigma)^2
+   s2.post <- fit.eb$s2.post
+   t.ord <- fit.eb$coefficients[, 2]/fit.eb$sigma/fit.eb$stdev.unscaled[, 2]
+   t.mod <- fit.eb$t[, 2]
+   p.ord <- 2*pt(-abs(t.ord), fit.eb$df.residual)
+   p.mod <- fit.eb$p.value[, 2]
+   q.ord <- qvalue(p.ord)$q
+   q.mod <- qvalue(p.mod)$q
+   results.eb <- data.frame(log2FC, t.ord, t.mod, p.ord, p.mod, q.ord, q.mod, df.r, df.0, s2.0, s2, s2.post)
+   results.eb <- results.eb[order(results.eb$p.mod), ]
+   return(results.eb)
+ }
> dat<-read.peptides(dat,cha) # reduces peptide number
Warning message:
In read.peptides(dat, cha) : NAs introduced by coercion
> dim(dat)
[1] 12817    24
> str(dat)
'data.frame':	12817 obs. of  24 variables:
 $ Protein.Group.Accessions: chr  "A0A7U9GJV6" "A0A7U9GJV6" "A0A7U9GJV6" "A0A7U9GJV6" ...
 $ Protein.Description     : chr  "" "" "" "" ...
 $ Sequence                : chr  "AAAESMIPTTTGAAK" "AAAESMIPTTTGAAK" "AAAESMIPTTTGAAK" "AAAESMIPTTTGAAK" ...
 $ Quan.Usage              : chr  "Used" "Used" "Used" "Used" ...
 $ Quan.Info               : chr  "Unique" "Unique" "Unique" "Unique" ...
 $ Isolation.Interference  : num  0.20234 0.41829 0.01009 0.00784 0.09861 ...
 $ pH8_2000_13_a           : num  44019 0 136990 316930 77580 ...
 $ pH8_2000_13hr_b         : num  17831 0 71621 87466 43244 ...
 $ pH8_2000_13_c           : num  13292 0 48171 49619 28530 ...
 $ pH8_6_13_a              : num  39319 0 121430 152730 72029 ...
 $ pH8_6_13hr_b            : num  38170 0 128590 220140 70582 ...
 $ pH8_6_13_c              : num  38196 0 126190 280720 77592 ...
 $ pH8_0_13_a              : num  36996 0 111680 266300 66938 ...
 $ pH8_0_13hr_b            : num  21855 0 67175 120250 40979 ...
 $ pH8_0_13_c              : num  17172 0 63511 117330 49581 ...
 $ pH7_2000_13_a           : num  39033 0 131990 172970 80607 ...
 $ pH7_2000_13hr_b         : num  27043 0 85080 162510 49483 ...
 $ pH7_2000_13_c           : num  34940 0 115260 244700 65388 ...
 $ pH7_6_13_a              : num  20498 0 70290 103430 39587 ...
 $ pH7_6_13hr_b            : num  14491 0 55730 76646 32676 ...
 $ pH7_6_13_c              : num  18158 0 59825 89469 36735 ...
 $ pH7_0_13_a              : num  36396 0 116410 152450 67804 ...
 $ pH7_0_13hr_b            : num  33008 0 115500 207020 72604 ...
 $ pH7_0_13_c              : num  39070 0 124100 241630 69292 ...
> dat<-quantify.proteins(dat,cha) # identifies proteins from peptides, can take a few minutes, all rows with missing values will create a warning message, you can ignore this
> dim(dat)
[1] 1207   20
> str(dat)
'data.frame':	1207 obs. of  20 variables:
 $ pH8_2000_13_a  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH8_2000_13hr_b: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH8_2000_13_c  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH8_6_13_a     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH8_6_13hr_b   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH8_6_13_c     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH8_0_13_a     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH8_0_13hr_b   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH8_0_13_c     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH7_2000_13_a  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH7_2000_13hr_b: num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH7_2000_13_c  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH7_6_13_a     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH7_6_13hr_b   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH7_6_13_c     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH7_0_13_a     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH7_0_13hr_b   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ pH7_0_13_c     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ n.peptides     : num  27 4 9 6 7 5 13 6 7 16 ...
 $ n.spectra      : num  119 6 10 14 13 9 33 8 37 56 ...

AlexisW · July 12, 2023, 9:58pm

With your partial file, I can't run quantify.proteins() (I think because there is a single accession number and the function was written assuming you have more than 1 protein). One potential explanation: the second row has only 0, so when taking the log2() it becomes all -Inf, when subtracting the median it's all NaN, and that might propagate.

So, after read.peptides() but before quantify.proteins(), you'd want to filter the rows:

dat <- dat[rowSums(dat[,cha]) > 0, ]

Also, in your full dataset (I don't get that with the partial version), you have this warning:

Warning message:
In read.peptides(dat, cha) : NAs introduced by coercion

So that means after read.peptides() you will also have a few NAs sprinkled in your data, I'm not sure why (you need to look at the original data point), and I'm not sure it's a problem.

ctheisen · July 12, 2023, 10:28pm

Sorry about that, I had fixed the data and forgot to put in the new version, the ones below have at least a couple accession numbers.

dput(head(dat,14))
structure(list(Protein.Group.Accessions = c("A0A7U9GJV6", "A0A7U9GJV6", 
"A0A7U9GJV6", "A0A7U9GJV6", "A0A7U9GJV6", "A0A7U9GJV6", "A0A7U9GJV6", 
"A0A7U9GJV6", "A0A7U9GJV6", "A0A7U9GJV6", "A0A7U9GJV6", "A0A7U9GJV6", 
"A0A7U9C895", "A0A7U9C746"), Protein.Description = c("", "", 
"", "", "", "", "", "", "", "", "", "", "", ""), Sequence = c("AAAESMIPTTTGAAK", 
"AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", 
"AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", 
"AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", "AAAESMIPTTTGAAK", "AAAVSNIIR", 
"AADKLEAIVLDR"), Quan.Usage = c("Used", "Used", "Used", "Used", 
"Used", "Used", "Used", "Used", "Used", "Used", "Used", "Used", 
"Used", "Used"), Quan.Info = c("Unique", "Unique", "Unique", 
"Unique", "Unique", "Unique", "Unique", "Unique", "Unique", "Unique", 
"Unique", "Unique", "Unique", "Unique"), Isolation.Interference = c("0.202341629", 
"0.418291109", "0.01009192", "0.007835206", "0.098607589", "0.044367289", 
"0.079941481", "0.016930911", "#VALUE!", "0.011285234", "0.008204162", 
"0.407497747", "0.029316243", "0.337603647"), pH8_2000_13_a = c(44019, 
0, 136990, 316930, 77580, 60480, 146280, 195440, 0, 83974, 278770, 
0, 24815, 0), pH8_2000_13hr_b = c(17831, 0, 71621, 87466, 43244, 
22568, 61466, 66298, 0, 32016, 82810, 0, 23746, 0), pH8_2000_13_c = c(13292, 
0, 48171, 49619, 28530, 14063, 36551, 46387, 0, 18603, 45788, 
0, 19081, 0), pH8_6_13_a = c(39319, 0, 121430, 152730, 72029, 
70634, 123350, 130580, 0, 47671, 132710, 0, 19773, 0), pH8_6_13hr_b = c(38170, 
0, 128590, 220140, 70582, 63551, 153940, 175950, 0, 64651, 189620, 
0, 19884, 0), pH8_6_13_c = c(38196, 0, 126190, 280720, 77592, 
57076, 128280, 172690, 0, 82057, 249410, 0, 20117, 0), pH8_0_13_a = c(36996, 
0, 111680, 266300, 66938, 65370, 147400, 209180, 0, 68167, 216400, 
0, 22236, 0), pH8_0_13hr_b = c(21855, 0, 67175, 120250, 40979, 
32068, 70617, 95884, 0, 38423, 106260, 0, 22106, 0), pH8_0_13_c = c(17172, 
0, 63511, 117330, 49581, 21080, 52537, 64880, 0, 40493, 109110, 
0, 21669, 0), pH7_2000_13_a = c(39033, 0, 131990, 172970, 80607, 
82387, 129900, 132070, 0, 56556, 149220, 0, 24066, 0), pH7_2000_13hr_b = c(27043, 
0, 85080, 162510, 49483, 49123, 120020, 140200, 0, 44287, 141830, 
0, 16682, 0), pH7_2000_13_c = c(34940, 0, 115260, 244700, 65388, 
51025, 115330, 151230, 0, 61133, 197510, 0, 20501, 0), pH7_6_13_a = c(20498, 
0, 70290, 103430, 39587, 26608, 60005, 67229, 0, 31730, 89734, 
0, 18984, 0), pH7_6_13hr_b = c(14491, 0, 55730, 76646, 32676, 
14832, 38949, 48179, 0, 31304, 79386, 0, 21520, 0), pH7_6_13_c = c(18158, 
0, 59825, 89469, 36735, 27595, 63923, 77255, 0, 31871, 83442, 
0, 21737, 0), pH7_0_13_a = c(36396, 0, 116410, 152450, 67804, 
118380, 121830, 132710, 0, 49137, 135210, 0, 22417, 0), pH7_0_13hr_b = c(33008, 
0, 115500, 207020, 72604, 50188, 116870, 148760, 0, 60357, 174450, 
0, 22961, 0), pH7_0_13_c = c(39070, 0, 124100, 241630, 69292, 
57035, 127700, 162590, 0, 60541, 188830, 0, 23115, 0)), row.names = c(NA, 
14L), class = "data.frame")

I tried running the lines of code within quantify.proteins() individually and didn't get a problem until I got to the

temp <- apply(sdat[,-1], 2, e.function,seq=sdat[, 1])

That's when a lot of the data was replaced with NA values. Up until that point it all seems to work fine. There may be a few NA's in there, but there are very few. I incorporated your line to filter the rows and it still had this problem. The read.peptides() does add a few NAs, but not many.

AlexisW · July 12, 2023, 10:36pm

When I filter, I don't get any NA after quantify.protein():

source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/read.peptides.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/quantify.proteins.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.r")

dat <- [...]

cha<-c('pH8_2000_13_a','pH8_2000_13hr_b','pH8_2000_13_c','pH8_6_13_a',
       'pH8_6_13hr_b','pH8_6_13_c','pH8_0_13_a','pH8_0_13hr_b',
       'pH8_0_13_c','pH7_2000_13_a',
       'pH7_2000_13hr_b','pH7_2000_13_c','pH7_6_13_a','pH7_6_13hr_b',
       'pH7_6_13_c','pH7_0_13_a','pH7_0_13hr_b','pH7_0_13_c') 




dat_read <- read.peptides(dat, cha) 
#> Warning in read.peptides(dat, cha): NAs introduced by coercion

table(is.na(dat_read))
#> 
#> FALSE 
#>   312

dat_filter <- dat_read[rowSums(dat_read[,cha]) > 0, ]

dat_quant_unfilt <- quantify.proteins(dat_read, cha) 
dat_quant_unfilt
#>            pH8_2000_13_a pH8_2000_13hr_b pH8_2000_13_c pH8_6_13_a pH8_6_13hr_b
#> A0A7U9GJV6            NA              NA            NA         NA           NA
#> A0A7U9C895            NA              NA            NA         NA           NA
#> A0A7U9C746            NA              NA            NA         NA           NA
#>            pH8_6_13_c pH8_0_13_a pH8_0_13hr_b pH8_0_13_c pH7_2000_13_a
#> A0A7U9GJV6         NA         NA           NA         NA            NA
#> A0A7U9C895         NA         NA           NA         NA            NA
#> A0A7U9C746         NA         NA           NA         NA            NA
#>            pH7_2000_13hr_b pH7_2000_13_c pH7_6_13_a pH7_6_13hr_b pH7_6_13_c
#> A0A7U9GJV6              NA            NA         NA           NA         NA
#> A0A7U9C895              NA            NA         NA           NA         NA
#> A0A7U9C746              NA            NA         NA           NA         NA
#>            pH7_0_13_a pH7_0_13hr_b pH7_0_13_c n.peptides n.spectra
#> A0A7U9GJV6         NA           NA         NA          1        11
#> A0A7U9C895         NA           NA         NA          1         1
#> A0A7U9C746         NA           NA         NA          1         1

dat_quant_filter <- quantify.proteins(dat_filter, cha) 
dat_quant_filter
#>            pH8_2000_13_a pH8_2000_13hr_b pH8_2000_13_c pH8_6_13_a pH8_6_13hr_b
#> A0A7U9GJV6         0.064          -0.516         -0.69       0.07        0.226
#> A0A7U9C895        -0.064           0.516          0.69      -0.07       -0.226
#>            pH8_6_13_c pH8_0_13_a pH8_0_13hr_b pH8_0_13_c pH7_2000_13_a
#> A0A7U9GJV6      0.144      0.141       -0.357     -0.444             0
#> A0A7U9C895     -0.144     -0.141        0.357      0.444             0
#>            pH7_2000_13hr_b pH7_2000_13_c pH7_6_13_a pH7_6_13hr_b pH7_6_13_c
#> A0A7U9GJV6           0.142         0.035     -0.294        -0.54     -0.452
#> A0A7U9C895          -0.142        -0.035      0.294         0.54      0.452
#>            pH7_0_13_a pH7_0_13hr_b pH7_0_13_c n.peptides n.spectra
#> A0A7U9GJV6     -0.032            0      0.029          1         9
#> A0A7U9C895      0.032            0     -0.029          1         1

^{Created on 2023-07-12 with reprex v2.0.2}

ctheisen · July 13, 2023, 2:13pm

So I think I have it set up the same way you do, and it doesn't matter whether I filter or not, I am still getting this issue.

R version 4.3.1 (2023-06-16 ucrt) -- "Beagle Scouts"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> if (!requireNamespace("BiocManager", quietly = TRUE))
+   install.packages("BiocManager")
> BiocManager::install("limma")
'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://ftp.osuosl.org/pub/cran/
Bioconductor version 3.17 (BiocManager 1.30.21), R 4.3.1 (2023-06-16 ucrt)
Warning message:
package(s) not installed when version(s) same as or greater than current; use `force =
  TRUE` to re-install: 'limma' 
> BiocManager::install("qvalue")
'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://ftp.osuosl.org/pub/cran/
Bioconductor version 3.17 (BiocManager 1.30.21), R 4.3.1 (2023-06-16 ucrt)
Warning message:
package(s) not installed when version(s) same as or greater than current; use `force =
  TRUE` to re-install: 'qvalue' 
> 
> library(limma)
> library(qvalue)
> ## Load in data ##################
> dat <- read.csv("E:\\2023LDRD proteomics\\LDRD2023_proteomics_set1\\txt_setT1\\peptide_format_for_R3.csv") # load data
> head(dat)
  Protein.Group.Accessions Protein.Description        Sequence Quan.Usage Quan.Info
1               A0A7U9GJV6                     AAAESMIPTTTGAAK       Used    Unique
2               A0A7U9GJV6                     AAAESMIPTTTGAAK       Used    Unique
3               A0A7U9GJV6                     AAAESMIPTTTGAAK       Used    Unique
4               A0A7U9GJV6                     AAAESMIPTTTGAAK       Used    Unique
5               A0A7U9GJV6                     AAAESMIPTTTGAAK       Used    Unique
6               A0A7U9GJV6                     AAAESMIPTTTGAAK       Used    Unique
  Isolation.Interference pH8_2000_13_a pH8_2000_13hr_b pH8_2000_13_c pH8_6_13_a
1            0.202341629         44019           17831         13292      39319
2            0.418291109             0               0             0          0
3             0.01009192        136990           71621         48171     121430
4            0.007835206        316930           87466         49619     152730
5            0.098607589         77580           43244         28530      72029
6            0.044367289         60480           22568         14063      70634
  pH8_6_13hr_b pH8_6_13_c pH8_0_13_a pH8_0_13hr_b pH8_0_13_c pH7_2000_13_a pH7_2000_13hr_b
1        38170      38196      36996        21855      17172         39033           27043
2            0          0          0            0          0             0               0
3       128590     126190     111680        67175      63511        131990           85080
4       220140     280720     266300       120250     117330        172970          162510
5        70582      77592      66938        40979      49581         80607           49483
6        63551      57076      65370        32068      21080         82387           49123
  pH7_2000_13_c pH7_6_13_a pH7_6_13hr_b pH7_6_13_c pH7_0_13_a pH7_0_13hr_b pH7_0_13_c
1         34940      20498        14491      18158      36396        33008      39070
2             0          0            0          0          0            0          0
3        115260      70290        55730      59825     116410       115500     124100
4        244700     103430        76646      89469     152450       207020     241630
5         65388      39587        32676      36735      67804        72604      69292
6         51025      26608        14832      27595     118380        50188      57035
> cha<-c('pH8_2000_13_a','pH8_2000_13hr_b','pH8_2000_13_c','pH8_6_13_a',
+        'pH8_6_13hr_b','pH8_6_13_c','pH8_0_13_a','pH8_0_13hr_b',
+        'pH8_0_13_c','pH7_2000_13_a',
+        'pH7_2000_13hr_b','pH7_2000_13_c','pH7_6_13_a','pH7_6_13hr_b',
+        'pH7_6_13_c','pH7_0_13_a','pH7_0_13hr_b','pH7_0_13_c') # defines vector of headlines of all channels as they are store in data set
> # from : http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/read.peptides.r
> read.peptides <- function(dat, cha){ # excludes rows (spectra) with missing values, higher isolation interference than 30% and that are redundant or not used
+   output <- NULL
+   
+   dat$Sequence <- as.character(dat$Sequence)
+   dat$Protein.Group.Accessions <- as.character(dat$Protein.Group.Accessions)
+   dat$Quan.Usage <- as.character(dat$Quan.Usage)
+   dat$Quan.Info <- as.character(dat$Quan.Info)
+   dat$Isolation.Interference <- as.numeric(as.character(dat$Isolation.Interference))
+   
+   dat <- subset(dat, Isolation.Interference<=30)  
+   dat <- subset(dat, Quan.Usage=="Used")
+   dat <- subset(dat, Protein.Group.Accessions!="")
+   dat <- subset(dat, !apply(dat[cha], 1, f <- function(x) any(is.na(x))))  
+ }
> # from : http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/quantify.proteins.r
> quantify.proteins <- function(dat, cha){ # quantifies all proteins that are represented by the peptide spectra. each protein in the channel are log2 transformed from the median value of all peptides belonging to each protein
+   e.function <- function(x, seq) tapply(x, seq, median)
+   output <- NULL
+   
+   dat$Sequence <- toupper(dat$Sequence)
+   accessions <- as.character(unique(dat$Protein.Group.Accessions))
+   n.proteins <- length(accessions)
+   n.cha <- length(cha)
+   
+   for(k in 1:n.proteins){
+     id <- accessions[k]
+     sdat <- subset(dat, Protein.Group.Accessions==id)[c("Sequence", cha)]
+     sdat[cha] <- log2(sdat[cha])
+     sdat[cha] <- sdat[cha] - apply(sdat[cha], 1, median)
+     pdat <- sdat[, -1]
+     n.spectra <- ifelse(is.integer(dim(pdat)), nrow(pdat), 1)
+     temp <- apply(sdat[,-1], 2, e.function,seq=sdat[, 1])          
+     n.peptides <- ifelse(is.integer(dim(temp)), nrow(temp), 1)    
+     if(is.integer(dim(pdat))) pdat <- apply(pdat, 2, median)
+     pdat <- c(pdat, n.peptides=n.peptides, n.spectra=n.spectra)
+     output <- rbind(output, pdat)
+   }
+   output[,1:n.cha] <- sweep(output[,1:n.cha],1,apply(output[,1:n.cha],1,median))
+   output[,1:n.cha] <- sweep(output[,1:n.cha],2,apply(output[,1:n.cha],2,median))
+   output[,1:n.cha] <- sweep(output[,1:n.cha],1,apply(output[,1:n.cha],1,median))
+   output[,1:n.cha] <- round(output[,1:n.cha],3)
+   row.names(output) <- accessions
+   output <- as.data.frame(output)
+   
+   return(output)
+ }
> # from : http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.r
> eb.fit <- function(dat, design){ # eb = empirical Bayes; performs stats analysis of interest - 2 sample t-tests. Out put is a dataframe that contains log2 for each protein and its ordinary as well as moderated p- and q-values. Results are sorted by p-values in increasing order
+   n <- dim(dat)[1]
+   fit <- lmFit(dat, design)
+   fit.eb <- eBayes(fit)
+   log2FC <- fit.eb$coefficients[, 2]
+   df.r <- fit.eb$df.residual
+   df.0 <- rep(fit.eb$df.prior, n)
+   s2.0 <- rep(fit.eb$s2.prior, n)
+   s2 <- (fit.eb$sigma)^2
+   s2.post <- fit.eb$s2.post
+   t.ord <- fit.eb$coefficients[, 2]/fit.eb$sigma/fit.eb$stdev.unscaled[, 2]
+   t.mod <- fit.eb$t[, 2]
+   p.ord <- 2*pt(-abs(t.ord), fit.eb$df.residual)
+   p.mod <- fit.eb$p.value[, 2]
+   q.ord <- qvalue(p.ord)$q
+   q.mod <- qvalue(p.mod)$q
+   results.eb <- data.frame(log2FC, t.ord, t.mod, p.ord, p.mod, q.ord, q.mod, df.r, df.0, s2.0, s2, s2.post)
+   results.eb <- results.eb[order(results.eb$p.mod), ]
+   return(results.eb)
+ }
> dat_read<-read.peptides(dat,cha) # reduces peptide number
Warning message:
In read.peptides(dat, cha) : NAs introduced by coercion
> table(is.na(dat_read))

 FALSE 
307608 
> dat_filter <- dat_read[rowSums(dat_read[,cha]) > 0, ]
> dim(dat)
[1] 15339    24
> dim(dat)
[1] 15339    24
> dat_quant_unfilt<-quantify.proteins(dat_read,cha) # identifies proteins from peptides, can take a few minutes, all rows with missing values will create a warning message, you can ignore this
> head(dat_quant_unfilt)
           pH8_2000_13_a pH8_2000_13hr_b pH8_2000_13_c pH8_6_13_a pH8_6_13hr_b pH8_6_13_c
A0A7U9GJV6            NA              NA            NA         NA           NA         NA
A0A7U9C895            NA              NA            NA         NA           NA         NA
A0A7U9C746            NA              NA            NA         NA           NA         NA
A0A7U9CBB9            NA              NA            NA         NA           NA         NA
A0A7U9GMP7            NA              NA            NA         NA           NA         NA
A0A7U9GIZ4            NA              NA            NA         NA           NA         NA
           pH8_0_13_a pH8_0_13hr_b pH8_0_13_c pH7_2000_13_a pH7_2000_13hr_b pH7_2000_13_c
A0A7U9GJV6         NA           NA         NA            NA              NA            NA
A0A7U9C895         NA           NA         NA            NA              NA            NA
A0A7U9C746         NA           NA         NA            NA              NA            NA
A0A7U9CBB9         NA           NA         NA            NA              NA            NA
A0A7U9GMP7         NA           NA         NA            NA              NA            NA
A0A7U9GIZ4         NA           NA         NA            NA              NA            NA
           pH7_6_13_a pH7_6_13hr_b pH7_6_13_c pH7_0_13_a pH7_0_13hr_b pH7_0_13_c
A0A7U9GJV6         NA           NA         NA         NA           NA         NA
A0A7U9C895         NA           NA         NA         NA           NA         NA
A0A7U9C746         NA           NA         NA         NA           NA         NA
A0A7U9CBB9         NA           NA         NA         NA           NA         NA
A0A7U9GMP7         NA           NA         NA         NA           NA         NA
A0A7U9GIZ4         NA           NA         NA         NA           NA         NA
           n.peptides n.spectra
A0A7U9GJV6         27       119
A0A7U9C895          4         6
A0A7U9C746          9        10
A0A7U9CBB9          6        14
A0A7U9GMP7          7        13
A0A7U9GIZ4          5         9
> dat_quant_filter<-quantify.proteins(dat_filter,cha)
> head(dat_quant_filter)
           pH8_2000_13_a pH8_2000_13hr_b pH8_2000_13_c pH8_6_13_a pH8_6_13hr_b pH8_6_13_c
A0A7U9GJV6            NA              NA            NA         NA           NA         NA
A0A7U9C895            NA              NA            NA         NA           NA         NA
A0A7U9CBB9            NA              NA            NA         NA           NA         NA
A0A7U9GMP7            NA              NA            NA         NA           NA         NA
A0A7U9C8C5            NA              NA            NA         NA           NA         NA
A0A7U9GMW1            NA              NA            NA         NA           NA         NA
           pH8_0_13_a pH8_0_13hr_b pH8_0_13_c pH7_2000_13_a pH7_2000_13hr_b pH7_2000_13_c
A0A7U9GJV6         NA           NA         NA            NA              NA            NA
A0A7U9C895         NA           NA         NA            NA              NA            NA
A0A7U9CBB9         NA           NA         NA            NA              NA            NA
A0A7U9GMP7         NA           NA         NA            NA              NA            NA
A0A7U9C8C5         NA           NA         NA            NA              NA            NA
A0A7U9GMW1         NA           NA         NA            NA              NA            NA
           pH7_6_13_a pH7_6_13hr_b pH7_6_13_c pH7_0_13_a pH7_0_13hr_b pH7_0_13_c
A0A7U9GJV6         NA           NA         NA         NA           NA         NA
A0A7U9C895         NA           NA         NA         NA           NA         NA
A0A7U9CBB9         NA           NA         NA         NA           NA         NA
A0A7U9GMP7         NA           NA         NA         NA           NA         NA
A0A7U9C8C5         NA           NA         NA         NA           NA         NA
A0A7U9GMW1         NA           NA         NA         NA           NA         NA
           n.peptides n.spectra
A0A7U9GJV6         27        91
A0A7U9C895          4         5
A0A7U9CBB9          6         9
A0A7U9GMP7          7        10
A0A7U9C8C5         11        22
A0A7U9GMW1          7        26

There are over 15000 rows, so I think making a full dput() might be too long. I added in the dimensions from before and after the filtering step and it doesn't look like anything was removed

ctheisen · July 13, 2023, 2:22pm

And also, I tried counting the number of NA values and it doesn't appear there are any after read.peptides(), even though it says they were introduced

> dat <- read.csv("E:\\2023LDRD proteomics\\LDRD2023_proteomics_set1\\txt_setT1\\peptide_format_for_R3.csv") # load data
> dat<-read.peptides(dat,cha) # reduces peptide number
Warning message:
In read.peptides(dat, cha) : NAs introduced by coercion
> sum(is.na(dat))
[1] 0

AlexisW · July 13, 2023, 2:54pm

You forgot to change the variable name, what are:

dim(dat_read)
dim(dat_filter)

I'm not sure, but based on our previous experimentations I expect if there is a single line that is NA or constant, then the whole column will end up NA (because the median of NA is NA, and they have a subtract the median step).

You can try to detect constant rows with:

which(rowVars(dat[, cha]) == 0)

if there are any you'd want to filter them.

If it's still NA, you should try running it with only the first 6 rows, based on our previous experiementations it shouldn't be all NA. Then try running with the first 100 rows, the first 1,000 rows etc, to try to find which row is the one that causes NAs when present in the input.

That's due to a #VALUE! in row 9 of dat$Isolation.Interference, which probably comes from Excel. I don't think this column is really used later on, so probably doesn't matter.

ctheisen · July 13, 2023, 5:51pm

So I narrowed the first error down to 633, and the last 20 rows of that are shown, with 633 being the last row. The only think I can think of is it is the only one that has a lot of zeros, but not all zeros, in the sample names. So I would like to try to find a way to rewrite the function to account for these. Maybe making the Log2 into an if statement, as in if it errors it will replace it with zero. Do you know how I could do that?


dput(tail(dat_633_read_filter,20))
structure(list(Protein.Group.Accessions = c("A0A7U9GI36", "A0A7U9CA74", 
"A0A7U9C4Y2", "A0A7U9GKQ1", "CON__P02533;CON__P08779", "CON__P02533;CON__P08779", 
"A0A7U9GKD8", "A0A7U9GKD8", "CON__P13645", "CON__P13645", "CON__P13645", 
"CON__P13645", "CON__P13645", "A0A7U9GKA0", "A0A7U9GKA0", "A0A7U9GHS1", 
"A0A7U9C4E6", "A0A7U9CFZ9", "A0A7U9CFZ9", "A0A7U9CFZ9"), Protein.Description = c("", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", ""), Sequence = c("ALDMSVSK", "ALDQIIAGK", "ALDTIDK", 
"ALEDCVK", "ALEEANADLEVK", "ALEEANADLEVK", "ALEEPVR", "ALEEPVR", 
"ALEESNYELEGK", "ALEESNYELEGK", "ALEESNYELEGK", "ALEESNYELEGK", 
"ALEESNYELEGK", "ALELNLKPIVVINK", "ALELNLKPIVVINK", "ALGEAGLDEIR", 
"ALGKNEENISDEIEELIK", "ALGYGTDAEIIEFFGEDER", "ALGYGTDAEIIEFFGEDER", 
"ALGYGTDAEIIEFFGEDER"), Quan.Usage = c("Used", "Used", "Used", 
"Used", "Used", "Used", "Used", "Used", "Used", "Used", "Used", 
"Used", "Used", "Used", "Used", "Used", "Used", "Used", "Used", 
"Used"), Quan.Info = c("Unique", "Unique", "Unique", "Unique", 
"Unique", "Unique", "Unique", "Unique", "Unique", "Unique", "Unique", 
"Unique", "Unique", "Unique", "Unique", "Unique", "Unique", "Unique", 
"Unique", "Unique"), Isolation.Interference = c(0.121812939, 
0.152421013, 0.056531879, 0.164104334, 0.09729899, 0.083573197, 
0.025195125, 0.012130391, 0.059432489, 0.208410884, 0.02243324, 
0.132142199, 0.100807901, 0.424468934, 0.176699674, 0.160085462, 
0.209427118, 0.007959622, 0.020562568, 0.122213884), pH8_2000_13_a = c(72853, 
34546, 74807, 140720, 7309.9, 31725, 22834, 37400, 13515, 54315, 
58018, 16663, 36854, 6023.5, 23951, 63488, 5181.2, 9336.8, 12941, 
0), pH8_2000_13hr_b = c(87450, 38854, 62004, 133310, 5346.8, 
32758, 20210, 41142, 12987, 53091, 58464, 15008, 36435, 8024.4, 
27715, 64356, 6987.2, 10002, 12387, 0), pH8_2000_13_c = c(80360, 
23501, 50560, 100130, 2526.2, 22959, 15354, 30617, 5908, 24034, 
23035, 6915.9, 15825, 5216.5, 16132, 47792, 6703.2, 4809, 5144.5, 
0), pH8_6_13_a = c(55169, 10738, 132360, 227160, 12396, 46735, 
21307, 28600, 57653, 173950, 256810, 66587, 208650, 4793.3, 20118, 
111280, 5124.6, 8340.5, 12563, 829.5), pH8_6_13hr_b = c(61660, 
19035, 134340, 235160, 12036, 44813, 25962, 41062, 27004, 105460, 
102690, 36809, 79315, 6033.7, 19248, 97586, 4527.8, 7821.5, 7817.6, 
788.21), pH8_6_13_c = c(73760, 22573, 108470, 202340, 7710.7, 
37366, 19867, 37095, 23112, 67877, 95133, 23905, 67220, 8025.5, 
20365, 73649, 6158.8, 8054.3, 11446, 206.92), pH8_0_13_a = c(69610, 
31793, 90787, 194350, 4037.5, 21813, 19871, 35174, 6499.6, 34327, 
22217, 8273.2, 12684, 5487.5, 18359, 59150, 5187.6, 6093.7, 7739.5, 
0), pH8_0_13hr_b = c(79237, 31188, 47736, 85897, 5878, 33282, 
17516, 35146, 12537, 51438, 57280, 14006, 33004, 7217.7, 23481, 
53355, 4329.2, 7928.9, 9001.3, 0), pH8_0_13_c = c(71563, 40383, 
41495, 57775, 3951.7, 28924, 6415.1, 11918, 8963.7, 35780, 46012, 
10077, 27242, 6771.8, 21368, 54668, 7079.7, 6173, 5578.8, 0), 
    pH7_2000_13_a = c(54481, 8352.8, 158970, 220590, 13724, 54281, 
    23017, 34165, 40801, 126290, 154520, 47351, 120390, 6345.8, 
    17409, 119920, 6930.4, 10171, 14529, 738.75), pH7_2000_13hr_b = c(40545, 
    8997.8, 115910, 193740, 6614.8, 28276, 22665, 40459, 15964, 
    56364, 48113, 21261, 40574, 4217.5, 17232, 74088, 3528.7, 
    7458.7, 8574, 732.33), pH7_2000_13_c = c(65621, 19377, 147300, 
    253690, 21768, 47616, 20736, 39613, 20938, 73442, 90697, 
    28707, 75278, 6149.6, 20988, 70256, 5376.1, 7457.9, 9622.5, 
    0), pH7_6_13_a = c(59654, 25595, 122020, 227650, 7099.8, 
    27740, 17162, 27811, 14681, 50057, 68354, 17730, 42149, 5913.6, 
    16891, 51806, 4257.2, 7112.3, 11811, 0), pH7_6_13hr_b = c(77605, 
    35332, 35041, 68064, 2823.2, 25925, 14831, 31245, 7809.9, 
    34378, 37870, 10298, 20248, 6978.2, 22288, 49065, 7553.2, 
    8143.9, 14074, 0), pH7_6_13_c = c(78913, 36268, 44843, 73276, 
    2556, 26009, 13997, 27511, 8973.1, 35597, 29691, 11362, 17257, 
    5970, 23272, 61506, 6839.7, 5779.3, 6337.6, 0), pH7_0_13_a = c(42483, 
    7926.6, 185310, 271430, 15136, 42609, 22539, 36518, 39565, 
    128850, 191530, 52931, 151600, 4869.2, 16137, 105450, 5983.4, 
    9818.7, 11390, 857.11), pH7_0_13hr_b = c(57464, 16774, 124630, 
    201820, 7425.5, 39364, 21493, 37866, 22198, 73302, 75753, 
    30056, 63786, 6631.3, 19492, 89763, 4874.8, 8272.1, 11433, 
    297.81), pH7_0_13_c = c(66782, 18441, 145430, 283190, 13759, 
    43585, 25451, 43414, 29930, 94455, 137840, 36109, 113420, 
    4929.9, 20519, 79931, 5363, 9468.8, 12911, 259.15)), row.names = c(603L, 
608L, 609L, 611L, 612L, 613L, 615L, 616L, 618L, 620L, 621L, 622L, 
623L, 625L, 626L, 629L, 630L, 631L, 632L, 633L), class = "data.frame")

AlexisW · July 13, 2023, 7:56pm

Indeed, row 633 has 10 values at 0 (whose log is -Inf) and 8 values non-0, so when taking the median you get -Inf, and it propagates.

You wouldn't have the problem if you had only a few 0 sprinkled in the row, only if they make up more than half the values (as you can safely calculate a median as long as you know half the values are under it). So I'd suggest you could filter out rows with too many 0:

dat_read <- read.peptides(dat, cha) 


# filter out constant rows
dat_filter <- dat_read[rowVars(dat_read[,cha]) > 0, ]
# filter out rows with more than 40% of 0
dat_filter <- dat_filter[rowMeans(dat_filter[,cha] == 0) < 0.4, ]

dat_quant_unfilt <- quantify.proteins(dat_read, cha) 
dat_quant_unfilt

dat_quant_filter <- quantify.proteins(dat_filter, cha) 
dat_quant_filter

Another possibility is indeed to rewrite the function, what is commonly done in these situations is to add a pseudocount of 1 so that you don't take log(0):

sdat[cha] <- log2(sdat[cha] + 1)

Whether it's appropriate in your case depends on what the data represents.

An if statement to replace -Inf with 0 would probably not be correct: you'd end up with log(0) > log(0.5), which doesn't really make sense. For how to do it, something like that should work (but again, I don't recommend it):

sda[sda == -Inf] <- 0

ctheisen · July 13, 2023, 8:16pm

Thank you for all your help!

system · July 20, 2023, 8:17pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.