Hello,
the goal of my project is to implement multiple ML algorithms, one of them being an elastic net logistic regression. In order to make my code as simple as possible (run for different specified models), I want to use tidymodels
.
For the elastic net logistic regression I set a fixed mixture of 0.7 and want to find the optimal penalty lambda. As its estimate can be unstable I do the CV multiple times (50). This goes quite fast glmnet
, however it takes much longer using tidy models.
I suspect it has to do with the way the grid search goes, in tidymodels
they try all penalities while glmnet
uses a targeted search, see here from cv.glmnet helper:
"glmnet
chooses its own sequence. Note that this is done for the full model (master sequence), and separately for each fold. The fits are then alligned using the master sequence (see the allignment
argument for additional details). Adapting lambda
for each fold leads to better convergence."
Does anybody have an idea how to make tidymodels
use the cv path of cv.glmnet
? Or am I missing something. This might also result in a feature request for the package.
Sidenote: Both approaches result in similar hyperparameters and thus predictions. So its not that different models are fit.
Here is a reproducible example on a dummy dataset with multiple different tuning grids, to show what I mean:
library(glmnet)
library(tidymodels)
data(BinomialExample)
## glmnet
set.seed(42)
X_batch <- BinomialExample$x |>
as.matrix()
y_batch <- BinomialExample$y
tictoc::tic("glmnet.cv:")
MSE <- NULL
for (j in 1:50){
cv_fit <- glmnet::cv.glmnet(X_batch, y_batch, family=c("binomial"), alpha = 0.7, type.measure = "mse", nfolds = 10)
MSE <- cbind(MSE, cv_fit$cvm)
}
tictoc::toc() # 7.633 sec elapsed
## Tidymodels
set.seed(42)
data_train <- data.frame(cbind(BinomialExample$x,BinomialExample$y))
colnames(data_train) <- c(seq(1,30),"y")
data_train$y <- factor(data_train$y)
cv_splits <- vfold_cv(data_train, v=10, repeats = 10, strata = "y")
mod <- logistic_reg(
mode = "classification",
engine = "glmnet",
penalty = tune(),
mixture = 0.7)
rec <- recipe(y~ ., data = data_train) |>
step_normalize(all_numeric())
wfl <- workflow() %>%
add_recipe(rec) %>%
add_model(mod)
grid1 <- grid_regular(penalty(), levels = 50)
grid2 <- grid_regular(penalty(range = c(-5,1), trans = log10_trans()), levels = 50)
grid3 <- grid_regular(penalty())
grid4 <- grid_regular(penalty(range = c(-5,1), trans = log10_trans()))
tictoc::tic("grid1:")
tune_results1 <- wfl |>
tune_grid(resamples = cv_splits,
grid = grid1,
metrics = metric_set(accuracy, roc_auc))
tictoc::toc() #57.014 sec elapsed
tictoc::tic("grid2:")
tune_results1 <- wfl |>
tune_grid(resamples = cv_splits,
grid = grid2,
metrics = metric_set(accuracy, roc_auc))
tictoc::toc() #58.22 sec elapsed
tictoc::tic("grid3:")
tune_results1 <- wfl |>
tune_grid(resamples = cv_splits,
grid = grid3,
metrics = metric_set(accuracy, roc_auc))
tictoc::toc() #46.867 sec elapsed
tictoc::tic("grid4:")
tune_results1 <- wfl |>
tune_grid(resamples = cv_splits,
grid = grid4,
metrics = metric_set(accuracy, roc_auc))
tictoc::toc() #46.201 sec elapsed