I am very new to R, so I apologise if this looks simple to someone.
I try to to join two files and then perform a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in casefile or controlfile.
casefile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
controlfile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
ENSG00000174776 1 1 0 2
ENSG00000076864 1 0 1 13
ENSG00000086015 1 0 1 25
I have this script:
#!/usr/bin/env Rscript
library("argparse")
suppressPackageStartupMessages(library("argparse"))
parser <- ArgumentParser()
parser$add_argument("--casefile", action="store")
parser$add_argument("--casesize", action="store", type="integer")
parser$add_argument("--controlfile", action="store")
parser$add_argument("--controlsize", action="store", type="integer")
parser$add_argument("--outfile", action="store")
args <- parser$parse_args()
case.dat<-read.delim(args$casefile, header=T, stringsAsFactors=F, sep="\t")
names(case.dat)[1]<-"GENE"
control.dat<-read.delim(args$controlfile, header=T, stringsAsFactors=F, sep="\t")
names(control.dat)[1]<-"GENE"
dat<-merge(case.dat, control.dat, by="GENE", all.x=T, all.y=T)
dat[is.na(dat)]<-0
dat$P_DOM<-0
dat$P_REC<-0
for(i in 1:nrow(dat)){
#Dominant model
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
if(case_count>args$casesize){
case_count<-args$casesize
}else if(case_count<0){
case_count<-0
}
if(control_count>args$controlsize){
control_count<-args$controlsize
}else if(control_count<0){
control_count<-0
}
mat<-cbind(c(case_count, (args$casesize-case_count)), c(control_count, (args$controlsize-control_count)))
dat[i,]$P_DOM<-fisher.test(mat, alternative="greater")$p.value
and problem starts in here:
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
the result of case_count and control_count is NULL values, however corresponding columns in both input files are NOT empty.
I tried to run the script above with assigning absolute numbers (1000 and 2000) to variables case_count and control_count , and the script worked without issues.
The main purpose of the code: GitHub - mhguo1/TRAPD: Burden testing against public controls
Run burden testing This script will run the actual burden testing. It performs a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in cases as compared to controls for each gene. It will perform this burden testing under a dominant and a recessive model.
It requires R; the script was tested using R v3.1, but any version of R should work. The script should be run as: Rscript burden.R --casefile casecounts.txt --casesize 100 --controlfile controlcounts.txt --controlsize 60000 --output burden.out.txt
The script has 5 required options:
--casefile: Path to the counts file for the cases, as generated in Step 2A --casesize: Number of cases that were tested in Step 2A --controlfile: Path to the counts file for the controls, as generated in Step 2B --controlsize: Number of controls that were tested in Step 2B. If using ExAC or gnomAD, please refer to the respective documentation for total sample size --output: Output file path/name Output: A tab delimited file with 10 columns:
#GENE: Gene name CASE_COUNT_HET: Number of cases carrying heterozygous qualifying variants in a given gene CASE_COUNT_CH: Number of cases carrying potentially compound heterozygous qualifying variants in a given gene CASE_COUNT_HOM: Number of cases carrying homozygous qualifying variants in a given gene. CASE_TOTAL_AC: Total AC for a given gene. CONTROL_COUNT_HET: Approximate number of controls carrying heterozygous qualifying variants in a given gene CONTROL_COUNT_HOM: Number of controlss carrying homozygous qualifying variants in a given gene. CONTROL_TOTAL_AC: Total AC for a given gene. P_DOM: p-value under the dominant model. P_REC: p-value under the recessive model.
I try to run genetic variant burden test with vcf files and external gnomAD controls. I found this repo suitable and trying to fix bugs now in it.
as a newbie in R statistics, I will be happy about any suggestion. Thank you!