Hi,
I am trying to perform a cox proportional hazards analysis using glmnet with tidymodels. I have gene expression data and I am trying to figure out which genes are associated with event-free survival.
I came across this question on stack overflow and have tried to use it on my own data; however, I run into an error:
→ A | error: $ operator is invalid for atomic vectors There were issues with some computations A: x1
Error in res$fit$preproc : $ operator is invalid for atomic vectors
By using a public dataset, I am able to re-create the error. See the code below.
librarian::shelf(tidyverse, dplyr, survival, tidymodels, glmnet, censored, Biobase, GEOquery)
# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
# obtaining expression matrix
df <- exprs(gset[[1]])
df <- df[-grep('^AFFX', rownames(df)),] # removing Affymetrix probes
df <- t(scale(t(df))) # scaling gene expression values
# extracting time and status from the metadata
idx <- which(colnames(pData(gset[[1]])) %in%
c('distant rfs:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
row.names = rownames(pData(gset[[1]])))
# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) any( is.na(x) ))
metadata <- metadata[!discard,]
# matching expression data to metadata
df <- df[,which(colnames(df) %in% rownames(metadata))]
# checking matching samplenames and order
all((colnames(df) == rownames(metadata)) == TRUE)
# create a merged dataframe of metadata and z-scored expression
coxdata <- data.frame(metadata, t(df)) %>%
dplyr::rename(status=distant.rfs.ch1, time=time.rfs.ch1) %>%
mutate(status=as.numeric(status)) %>%
mutate(time = as.numeric(gsub('^KJX|^KJ', '', time))) %>%
mutate(surv = Surv(time, status), .keep = "unused") %>%
relocate(surv, everything())
head(coxdata)[1:5,1:5]
# prep
lasso_prep <-
recipe(coxdata) %>%
update_role(everything(), new_role="predictor") %>%
update_role("surv", new_role="outcome")
lasso_tune <-
proportional_hazards(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet") %>%
set_mode("censored regression")
lasso_workflow <-
workflow() %>%
add_recipe(lasso_prep) %>%
add_model(lasso_tune)
# specify lasso models
set.seed(123)
cv_splits <- vfold_cv(coxdata, v = 5) # reducing the number of folds just for speed of this example
# running tuning grid
lasso_res <- tune_grid(lasso_workflow,
resamples = cv_splits,
# this will make a grid of 5 penalty values
grid = 5,
eval_time = 5000)
I have tried doing this on a subset of the genes, and the code works fine. For example, I have run the same code on the dataframe while subsetting different columns (i.e. 1:1000, 1001:2000, ..., 10001:20000, 20001:24491) and it executes with no errors. This only occurs when I am using every gene.
Am I not using this modelling function correctly? Perhaps it's not designed to handle this many predictors? Any help or advice would be appreciated.
edit: here is my sessionInfo() output:
R version 4.2.3 (2023-03-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 14.1.2Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dyliblocale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods baseother attached packages:
[1] GEOquery_2.66.0 censored_0.2.0.9001 glmnet_4.1-7 yardstick_1.3.0
[5] workflowsets_1.0.0 workflows_1.1.3 tune_1.1.2.9018 rsample_1.2.0
[9] recipes_1.0.9 parsnip_1.1.1.9008 modeldata_1.1.0 infer_1.0.4
[13] dials_1.2.0 scales_1.3.0 broom_1.0.4 tidymodels_1.0.0
[17] survminer_0.4.9 ggpubr_0.6.0 org.Hs.eg.db_3.16.0 AnnotationDbi_1.60.2
[21] IRanges_2.32.0 S4Vectors_0.36.2 Biobase_2.58.0 BiocGenerics_0.44.0
[25] biomaRt_2.54.1 RegParallel_1.16.0 arm_1.13-1 lme4_1.1-32
[29] Matrix_1.6-4 MASS_7.3-58.2 data.table_1.14.10 doParallel_1.0.17
[33] iterators_1.0.14 foreach_1.5.2 survival_3.5-3 lubridate_1.9.3
[37] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[41] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.4
[45] tidyverse_2.0.0loaded via a namespace (and not attached):
[1] backports_1.4.1 BiocFileCache_2.6.1 splines_4.2.3 listenv_0.9.0
[5] GenomeInfoDb_1.34.9 digest_0.6.33 librarian_1.8.1 fansi_1.0.6
[9] magrittr_2.0.3 memoise_2.0.1 limma_3.54.2 tzdb_0.4.0
[13] globals_0.16.2 Biostrings_2.66.0 gower_1.0.1 R.utils_2.12.2
[17] hardhat_1.3.0 timechange_0.2.0 prettyunits_1.2.0 colorspace_2.1-0
[21] blob_1.2.4 rappdirs_0.3.3 warp_0.2.1 xfun_0.40
[25] crayon_1.5.2 RCurl_1.98-1.10 zoo_1.8-12 glue_1.6.2
[29] gtable_0.3.4 ipred_0.9-14 zlibbioc_1.44.0 XVector_0.38.0
[33] car_3.1-1 shape_1.4.6 future.apply_1.11.1 abind_1.4-5
[37] DBI_1.1.3 rstatix_0.7.2 Rcpp_1.0.11 xtable_1.8-4
[41] progress_1.2.2 GPfit_1.0-8 bit_4.0.5 km.ci_0.5-6
[45] lava_1.7.3 prodlim_2023.08.28 httr_1.4.7 ellipsis_0.3.2
[49] R.methodsS3_1.8.2 pkgconfig_2.0.3 XML_3.99-0.14 nnet_7.3-18
[53] dbplyr_2.4.0 utf8_1.2.4 tidyselect_1.2.0 rlang_1.1.2
[57] DiceDesign_1.10 munsell_0.5.0 tools_4.2.3 cachem_1.0.8
[61] cli_3.6.2 generics_0.1.3 RSQLite_2.3.0 fastmap_1.1.1
[65] knitr_1.45 bit64_4.0.5 survMisc_0.5.6 KEGGREST_1.38.0
[69] future_1.33.1 nlme_3.1-162 R.oo_1.25.0 xml2_1.3.3
[73] compiler_4.2.3 rstudioapi_0.14 filelock_1.0.2 curl_5.1.0
[77] png_0.1-8 slider_0.3.1 ggsignif_0.6.4 lhs_1.1.6
[81] stringi_1.8.3 lattice_0.20-45 nloptr_2.0.3 KMsurv_0.1-5
[85] vctrs_0.6.5 pillar_1.9.0 lifecycle_1.0.4 furrr_0.3.1
[89] bitops_1.0-7 R6_2.5.1 gridExtra_2.3 parallelly_1.36.0
[93] codetools_0.2-19 boot_1.3-28.1 withr_2.5.2 GenomeInfoDbData_1.2.9
[97] hms_1.1.3 grid_4.2.3 rpart_4.1.19 timeDate_4032.109
[101] coda_0.19-4 class_7.3-21 minqa_1.2.5 carData_3.0-5