Error with tuning for censored regression

Hi,

I am trying to perform a cox proportional hazards analysis using glmnet with tidymodels. I have gene expression data and I am trying to figure out which genes are associated with event-free survival.

I came across this question on stack overflow and have tried to use it on my own data; however, I run into an error:

→ A | error: $ operator is invalid for atomic vectors There were issues with some computations A: x1
Error in res$fit$preproc : $ operator is invalid for atomic vectors

By using a public dataset, I am able to re-create the error. See the code below.

librarian::shelf(tidyverse, dplyr, survival, tidymodels, glmnet, censored, Biobase, GEOquery)

# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
# obtaining expression matrix
df <- exprs(gset[[1]])
df <- df[-grep('^AFFX', rownames(df)),] # removing Affymetrix probes
df <- t(scale(t(df))) # scaling gene expression values

# extracting time and status from the metadata
idx <- which(colnames(pData(gset[[1]])) %in%
  c('distant rfs:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
  row.names = rownames(pData(gset[[1]])))

# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) any( is.na(x) ))
metadata <- metadata[!discard,]

# matching expression data to metadata
df <- df[,which(colnames(df) %in% rownames(metadata))]

# checking matching samplenames and order
all((colnames(df) == rownames(metadata)) == TRUE)

# create a merged dataframe of metadata and z-scored expression
coxdata <- data.frame(metadata, t(df)) %>%
  dplyr::rename(status=distant.rfs.ch1, time=time.rfs.ch1) %>%
  mutate(status=as.numeric(status)) %>%
  mutate(time = as.numeric(gsub('^KJX|^KJ', '', time))) %>%
  mutate(surv = Surv(time, status), .keep = "unused") %>%
  relocate(surv, everything())

head(coxdata)[1:5,1:5]

# prep
lasso_prep <-
  recipe(coxdata) %>%
  update_role(everything(), new_role="predictor") %>%
  update_role("surv", new_role="outcome")

lasso_tune <- 
  proportional_hazards(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet") %>% 
  set_mode("censored regression")

lasso_workflow <-
  workflow() %>% 
  add_recipe(lasso_prep) %>% 
  add_model(lasso_tune)

# specify lasso models
set.seed(123)
cv_splits <- vfold_cv(coxdata, v = 5) # reducing the number of folds just for speed of this example

# running tuning grid
lasso_res <- tune_grid(lasso_workflow,
                       resamples = cv_splits,
                       # this will make a grid of 5 penalty values
                       grid = 5,
                       eval_time = 5000) 

I have tried doing this on a subset of the genes, and the code works fine. For example, I have run the same code on the dataframe while subsetting different columns (i.e. 1:1000, 1001:2000, ..., 10001:20000, 20001:24491) and it executes with no errors. This only occurs when I am using every gene.

Am I not using this modelling function correctly? Perhaps it's not designed to handle this many predictors? Any help or advice would be appreciated.

edit: here is my sessionInfo() output:

R version 4.2.3 (2023-03-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 14.1.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] GEOquery_2.66.0 censored_0.2.0.9001 glmnet_4.1-7 yardstick_1.3.0
[5] workflowsets_1.0.0 workflows_1.1.3 tune_1.1.2.9018 rsample_1.2.0
[9] recipes_1.0.9 parsnip_1.1.1.9008 modeldata_1.1.0 infer_1.0.4
[13] dials_1.2.0 scales_1.3.0 broom_1.0.4 tidymodels_1.0.0
[17] survminer_0.4.9 ggpubr_0.6.0 org.Hs.eg.db_3.16.0 AnnotationDbi_1.60.2
[21] IRanges_2.32.0 S4Vectors_0.36.2 Biobase_2.58.0 BiocGenerics_0.44.0
[25] biomaRt_2.54.1 RegParallel_1.16.0 arm_1.13-1 lme4_1.1-32
[29] Matrix_1.6-4 MASS_7.3-58.2 data.table_1.14.10 doParallel_1.0.17
[33] iterators_1.0.14 foreach_1.5.2 survival_3.5-3 lubridate_1.9.3
[37] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[41] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.4
[45] tidyverse_2.0.0

loaded via a namespace (and not attached):
[1] backports_1.4.1 BiocFileCache_2.6.1 splines_4.2.3 listenv_0.9.0
[5] GenomeInfoDb_1.34.9 digest_0.6.33 librarian_1.8.1 fansi_1.0.6
[9] magrittr_2.0.3 memoise_2.0.1 limma_3.54.2 tzdb_0.4.0
[13] globals_0.16.2 Biostrings_2.66.0 gower_1.0.1 R.utils_2.12.2
[17] hardhat_1.3.0 timechange_0.2.0 prettyunits_1.2.0 colorspace_2.1-0
[21] blob_1.2.4 rappdirs_0.3.3 warp_0.2.1 xfun_0.40
[25] crayon_1.5.2 RCurl_1.98-1.10 zoo_1.8-12 glue_1.6.2
[29] gtable_0.3.4 ipred_0.9-14 zlibbioc_1.44.0 XVector_0.38.0
[33] car_3.1-1 shape_1.4.6 future.apply_1.11.1 abind_1.4-5
[37] DBI_1.1.3 rstatix_0.7.2 Rcpp_1.0.11 xtable_1.8-4
[41] progress_1.2.2 GPfit_1.0-8 bit_4.0.5 km.ci_0.5-6
[45] lava_1.7.3 prodlim_2023.08.28 httr_1.4.7 ellipsis_0.3.2
[49] R.methodsS3_1.8.2 pkgconfig_2.0.3 XML_3.99-0.14 nnet_7.3-18
[53] dbplyr_2.4.0 utf8_1.2.4 tidyselect_1.2.0 rlang_1.1.2
[57] DiceDesign_1.10 munsell_0.5.0 tools_4.2.3 cachem_1.0.8
[61] cli_3.6.2 generics_0.1.3 RSQLite_2.3.0 fastmap_1.1.1
[65] knitr_1.45 bit64_4.0.5 survMisc_0.5.6 KEGGREST_1.38.0
[69] future_1.33.1 nlme_3.1-162 R.oo_1.25.0 xml2_1.3.3
[73] compiler_4.2.3 rstudioapi_0.14 filelock_1.0.2 curl_5.1.0
[77] png_0.1-8 slider_0.3.1 ggsignif_0.6.4 lhs_1.1.6
[81] stringi_1.8.3 lattice_0.20-45 nloptr_2.0.3 KMsurv_0.1-5
[85] vctrs_0.6.5 pillar_1.9.0 lifecycle_1.0.4 furrr_0.3.1
[89] bitops_1.0-7 R6_2.5.1 gridExtra_2.3 parallelly_1.36.0
[93] codetools_0.2-19 boot_1.3-28.1 withr_2.5.2 GenomeInfoDbData_1.2.9
[97] hms_1.1.3 grid_4.2.3 rpart_4.1.19 timeDate_4032.109
[101] coda_0.19-4 class_7.3-21 minqa_1.2.5 carData_3.0-5

An update: I have been able to get this to work in glmnet, though I encountered a similar error message (not sure if related)

library(glmnet)

# load series and platform data from GEO
gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
# obtaining expression matrix
df <- exprs(gset[[1]])
df <- df[-grep('^AFFX', rownames(df)),] # removing Affymetrix probes
df <- t(scale(t(df))) # scaling gene expression values

# extracting time and status from the metadata
idx <- which(colnames(pData(gset[[1]])) %in%
  c('distant rfs:ch1', 'time rfs:ch1'))
metadata <- data.frame(pData(gset[[1]])[,idx],
  row.names = rownames(pData(gset[[1]])))

# remove samples from the pdata that have any NA value
discard <- apply(metadata, 1, function(x) any( is.na(x) ))
metadata <- metadata[!discard,]

# assigning dataframes to y and x for simplicity
x <- t(df)
y <- metadata

# using code from glmnet vignette
set.seed(1)
cvfit <- cv.glmnet(x, y, family = "cox", type.measure = "C")
#Error: Cox model requires a matrix with columns 'time' (>0) and 'status' (binary) as a response; a 'Surv' object suffices
cvfit <- glmnet::cv.glmnet(x=x, y=as.matrix(y), family = "cox", type.measure = "C")
Error in Surv(y[, "time"], y[, "status"]) : Time variable is not numeric

Of course when you check whether the matrix format of y is numeric or not:

class(as.matrix(y)$time)
#Error in as.matrix(y)$time : $ operator is invalid for atomic vectors

I then tried to pass y as a Surv object with no luck

y_surv <- Surv(time=y$time, event=y$status, type="right")
cvfit <- glmnet::cv.glmnet(x=x, y=y_surv, family = "cox", type.measure = "C")
#Error in response.coxnet(y) :  cox.path() only supports 'Surv' objects of type 'right' or 'counting'

But when I combine the outcome with the predictors and pass it into the function like so, I get a result:

cox_df <- data.frame(y, x) %>% 
  mutate(surv = Surv(time=time, event=status), .keep = "unused") %>%
  dplyr::select(surv,everything())
cvfit <- glmnet::cv.glmnet(x=as.matrix(cox_df[,-1]), y=cox_df[,1], family = "cox", type.measure = "C")
# success!

I assume tidymodels utilizes glmnet functions, so perhaps this is the source of the error, though I'm not entirely sure why it happens in the first place. I suspect there's a check for whether time is numeric that is getting mixed up between matrix/dataframe.

Thanks for posting this! Glad to see you've got it working with glmnet directly. As for using tidymodels for this: It looks like the number of predictors is what's breaking this in our approach to wrapping glmnet.

I've opened an issue for this. That's wired pretty deeply into censored by the restrictions glmnet and parsnip pose on it, so no quick fix, I'm afraid. Thanks for so much for surfacing it though! You can track progress via the issue.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.