Running exome burden test with R ends up with incorrect NULL values

raven1960 · May 25, 2021, 7:57am

I am very new to R, so I apologise if this looks simple to someone.

I try to to join two files and then perform a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in casefile or controlfile.

casefile:

GENE  CASE_COUNT_HET  CASE_COUNT_CH   CASE_COUNT_HOM  CASE_TOTAL_AC
ENSG00000124209   1   0   0   1
ENSG00000064703   1   1   0   9
ENSG00000171408   1   0   0   1
ENSG00000110514   1   1   1   12
ENSG00000247077   1   1   1   7

controlfile:

    GENE  CASE_COUNT_HET  CASE_COUNT_CH   CASE_COUNT_HOM  CASE_TOTAL_AC
ENSG00000124209   1   0   0   1
ENSG00000064703   1   1   0   9
ENSG00000171408   1   0   0   1
ENSG00000110514   1   1   1   12
ENSG00000247077   1   1   1   7
ENSG00000174776   1   1   0   2
ENSG00000076864   1   0   1   13
ENSG00000086015   1   0   1   25

I have this script:

#!/usr/bin/env Rscript
library("argparse")
suppressPackageStartupMessages(library("argparse"))

parser <- ArgumentParser()
parser$add_argument("--casefile", action="store")
parser$add_argument("--casesize", action="store", type="integer")
parser$add_argument("--controlfile", action="store")
parser$add_argument("--controlsize", action="store", type="integer")
parser$add_argument("--outfile", action="store")

args <- parser$parse_args()

case.dat<-read.delim(args$casefile, header=T, stringsAsFactors=F, sep="\t")
names(case.dat)[1]<-"GENE"
control.dat<-read.delim(args$controlfile, header=T, stringsAsFactors=F, sep="\t")
names(control.dat)[1]<-"GENE"

dat<-merge(case.dat, control.dat, by="GENE", all.x=T, all.y=T)
dat[is.na(dat)]<-0

dat$P_DOM<-0
dat$P_REC<-0

for(i in 1:nrow(dat)){
  
  #Dominant model
  case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
  control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
  
  if(case_count>args$casesize){
    case_count<-args$casesize
  }else if(case_count<0){
    case_count<-0
   }
  if(control_count>args$controlsize){
    control_count<-args$controlsize
  }else if(control_count<0){
    control_count<-0
   }
  
  mat<-cbind(c(case_count, (args$casesize-case_count)), c(control_count, (args$controlsize-control_count)))
  dat[i,]$P_DOM<-fisher.test(mat, alternative="greater")$p.value
and problem starts in here:

  case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
  control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM

the result of case_count and control_count is NULL values, however corresponding columns in both input files are NOT empty.

I tried to run the script above with assigning absolute numbers (1000 and 2000) to variables case_count and control_count , and the script worked without issues.

The main purpose of the code: GitHub - mhguo1/TRAPD: Burden testing against public controls

Run burden testing This script will run the actual burden testing. It performs a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in cases as compared to controls for each gene. It will perform this burden testing under a dominant and a recessive model.

It requires R; the script was tested using R v3.1, but any version of R should work. The script should be run as: Rscript burden.R --casefile casecounts.txt --casesize 100 --controlfile controlcounts.txt --controlsize 60000 --output burden.out.txt

The script has 5 required options:

--casefile: Path to the counts file for the cases, as generated in Step 2A --casesize: Number of cases that were tested in Step 2A --controlfile: Path to the counts file for the controls, as generated in Step 2B --controlsize: Number of controls that were tested in Step 2B. If using ExAC or gnomAD, please refer to the respective documentation for total sample size --output: Output file path/name Output: A tab delimited file with 10 columns:

#GENE: Gene name CASE_COUNT_HET: Number of cases carrying heterozygous qualifying variants in a given gene CASE_COUNT_CH: Number of cases carrying potentially compound heterozygous qualifying variants in a given gene CASE_COUNT_HOM: Number of cases carrying homozygous qualifying variants in a given gene. CASE_TOTAL_AC: Total AC for a given gene. CONTROL_COUNT_HET: Approximate number of controls carrying heterozygous qualifying variants in a given gene CONTROL_COUNT_HOM: Number of controlss carrying homozygous qualifying variants in a given gene. CONTROL_TOTAL_AC: Total AC for a given gene. P_DOM: p-value under the dominant model. P_REC: p-value under the recessive model.

I try to run genetic variant burden test with vcf files and external gnomAD controls. I found this repo suitable and trying to fix bugs now in it.

as a newbie in R statistics, I will be happy about any suggestion. Thank you!

system · June 15, 2021, 7:58am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.