I have been using the lm() function for linear regression for two variables (y= age of biological sample, x=expression level of Gene Z), where the gene expression was pulled from RNA sequencing data from 15 independent samples at different ages.
I would like to perform linear regression of y (age of sample) versus 10,000 genes' expression, individually, with output of the p value for each of the 10,000 genes. The goal is to identify genes that correlate in expression to the age of the sample.
How can i run independent linear regressions on 10,000 genes, and have the output be the p value in a table?
Hi @Namin! In order to help you a bit more, we're going to need to see what your data looks like. We have an FAQ that can help you build a reproducible example, or reprex, for this purpose.
It can be a little difficult to figure out what data to include in a reprex—especially if you're working with medical or otherwise confidential data, which absolutely shouldn't be shared. This thread has some discussion on that:
We don't really need to know exactly what data you have—we're more interested in how it's structured, what the column names are, whether each group has 1 or many rows, that sort of thing. If you can give a synthetic data set (ie. made up values) with maybe 10--100 rows, that can help us get you started
Thanks for the tip. In the data set above, I've included 11 columns. The first column is the age (1, 0.75, 0.35, 0.1, 0) in triplicate. Each subsequent column is gene expression for a particular gene (gene names removed). My full data set has 10,000 columns. The code provides the linear regression coefficients for comparison between age (column 1) and the first gene (column 2), giving me a p val of 1.5^10-5 along with other output. I am just interested in the p value of the linear regression at this time, and coding it in a way such that I can easily obtain it for all of the 10,000 genes in my full data set.
Okay, so you're looking to make models between the first column (the response) and each of the subsequent columns (predictors) in turn? I think the tidyverse tools might help us get this done more easily
First, I'm gonna reshape the data so that every column other than the first is collapsed into a single column. Then, I'll nest the responses and gene expressions for each predictor group, so we'll have a data frame for each one. Finally, I'll run the regression for each group and extract the R-squared value.
suppressPackageStartupMessages(library(tidyverse))
#> Warning: package 'ggplot2' was built under R version 3.5.1
#> Warning: package 'dplyr' was built under R version 3.5.1
# tidy
tidy_data =
x1 %>%
as_data_frame() %>%
rename(response = V1) %>%
gather(key = 'gene', value = 'gene_expr', -response) %>%
print()
#> # A tibble: 150 x 3
#> response gene gene_expr
#> <dbl> <chr> <dbl>
#> 1 1 V2 26.6
#> 2 1 V2 27.8
#> 3 1 V2 17.8
#> 4 0.75 V2 12.0
#> 5 0.75 V2 15.1
#> 6 0.75 V2 11.0
#> 7 0.35 V2 13.6
#> 8 0.35 V2 14.3
#> 9 0.35 V2 9.62
#> 10 0.1 V2 6.06
#> # ... with 140 more rows
# now let's nest and model
tidy_data %>%
nest(-gene) %>%
mutate(
model = map(data, ~ lm(.$response ~ .$gene_expr)),
rsq = map_dbl(model, ~summary(.)$r.squared))
#> # A tibble: 10 x 4
#> gene data model rsq
#> <chr> <list> <list> <dbl>
#> 1 V2 <tibble [15 x 2]> <S3: lm> 0.775
#> 2 V3 <tibble [15 x 2]> <S3: lm> 0.914
#> 3 V4 <tibble [15 x 2]> <S3: lm> 0.891
#> 4 V5 <tibble [15 x 2]> <S3: lm> 0.615
#> 5 V6 <tibble [15 x 2]> <S3: lm> 0.785
#> 6 V7 <tibble [15 x 2]> <S3: lm> 0.362
#> 7 V8 <tibble [15 x 2]> <S3: lm> 0.806
#> 8 V9 <tibble [15 x 2]> <S3: lm> 0.746
#> 9 V10 <tibble [15 x 2]> <S3: lm> 0.695
#> 10 V11 <tibble [15 x 2]> <S3: lm> 0.282
Thanks for looking into this. So far so good, I think it is close.
The last bit is to output the p value for the linear regression. Currently it is producing the rsq (which is also useful, but not what I need at the moment). I can change "r.squared" to "adj.r.squared" or "sigma" and the code will run, but this wont get me closer to the p value.