Logistic regression model gives different results while tuning vs single model for LOOCV

I’ve made a small reproducible example using a dataframe with group data and 3 features. I have code that makes a single logistic regression model, and code that I’ve used for tuning. In the example I only use the condition penalty = 0.001 and mixture = 0 for simplicity and compare the accuracy, ROC AUC, and class predictions. I understand LOOCV/pathologically small samples sizes is not ideal.

The results summary is below. I believe that the results should be identical and wondering if I am doing something wrong or not fully understanding what’s happening.

accuracy:
• single model = 0.52,
• tuned model = 0.52

loocv auc:
• single model = 0.532,
• tuned model = 0.538

.pred_group2 for .row 1:
• single model = 0.331
• tuned model = 0.324

# ---- Load libraries and prepare data -----------------------------------------

library(tidymodels)

set.seed(1)

df <- data.frame(
  group = c('group1', 'group1', 'group2', 'group2', 'group1', 'group2', 
            'group2', 'group2', 'group2', 'group1', 'group1', 'group2', 
            'group2', 'group1', 'group1', 'group1', 'group1', 'group1', 
            'group1', 'group1', 'group2', 'group2', 'group2', 'group2', 
            'group1'),
  var1 = c(7.3333333, 3.0000000, 1.6666667, 0.6666667, 0.6666667, 1.0000000,
           0.0000000, 2.0000000, 1.0000000, 2.6666667, 1.3333333, 1.0000000,
           0.6666667, 1.0000000, 1.6666667, 5.6666667, 0.3333333, 3.3333333,
           4.3333333, 1.6666667, 0.6666667, 12.6666667, 1.0000000, 3.0000000, 
           1.0000000),
  var2 = c(6.0, 2.5, 0.0, 0.0, 0.0, 1.0, 0.5, 0.5, 0.0, 3.5, 0.0, 0.0, 0.5, 0.5,
           0.5, 2.0, 0.0, 0.0, 0.5, 0.0, 0.5, 7.5, 0.0, 3.5, 1.0),
  var3 = c(4.5833333, 3.7500000, 0.5000000, 1.0000000, 0.5000000, 0.5000000,
           1.0000000, 1.6666667, 0.8333333, 1.5000000, 1.5000000, 0.5000000,
           0.2500000, 1.3333333, 0.5000000, 3.2500000, 0.0000000, 0.7500000, 
           2.5000000, 1.0000000, 0.7500000, 3.5000000, 0.2500000, 1.0000000,
           0.0000000)
)

# ---- Single model ------------------------------------------------------------

# Prepare the recipe
rec <- recipe(group ~ . , 
              data = df) 

# Define the logistic regression model
EN_mod <- logistic_reg(penalty = 0.001, mixture = 0) %>% 
  set_mode("classification") %>% 
  set_engine("glmnet")

# Create a workflow which contains the recipe and the model
ckd_wflow <-
  workflow() %>%
  add_recipe(rec) %>%
  add_model(EN_mod)

# Fit the workflow
ckd_fit <- 
  ckd_wflow %>% 
  fit(data = df)

# View the model
ckd_fit %>%
  extract_fit_parsnip() %>%
  tidy()

# Create LOOCV folds
folds = vfold_cv(df, 
                 v = nrow(df))

# Function to get model coefficients
get_glmnet_coefs <- function(x) {
  x %>%
    extract_fit_engine() %>%
    tidy(return_zeros = TRUE) %>%
    rename(penalty = lambda)
}

# Perform resampling workflow.
resamp <- 
  ckd_wflow %>% 
  fit_resamples(
    resamples = folds, 
    control = control_resamples(save_pred = TRUE,
                                extract = get_glmnet_coefs))

# Collect metrics for resampling object.  It only shows accuracy since it's 
# LOOCV, we have to extract the predictions for each fold and combine ourselves
# to get a ROC AUC.
collect_metrics(resamp)

# Extract predictions from resampling object
preds <- bind_rows(resamp$.predictions)

# View preds for .row = 1 for comparison to tuning model
preds %>%
  filter(.row == 1)

# Generate ROC curve and print AUC for resampling
autoplot(roc_curve(data = preds, truth = group, .pred_group2))
print(roc_auc(data = preds, truth = group, .pred_group2))


# ---- Tune --------------------------------------------------------------------

# Define the penalty values to test
pen_vals <- c(0.001)

# Define the grid of penalty, and mixtures values to test
grid <- crossing(penalty = pen_vals,
                 mixture = c(0))

# Define recipe for tuning 
rec_tune <- recipe(group ~ . ,
                   data = df)

# Define elastic net model for tuning 
EN_mod_tune <- 
  logistic_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet", path_values = pen_vals) %>% 
  set_mode("classification")

# Define workflow for tuning
ckd_wflow_tune <-
  ckd_wflow %>% 
  update_model(EN_mod_tune) %>%
  update_recipe(rec_tune)

# Define control grid to extract coefficients and save predictions
parsnip_ctrl <- control_grid(extract = get_glmnet_coefs,
                             save_pred = TRUE)

# Perform tuning with loocv
glmnet_res <- 
  ckd_wflow_tune %>% 
  tune_grid(
    resamples = folds,
    grid = grid,
    control = parsnip_ctrl
  )

# Get accuracy
glmnet_res %>%
  collect_metrics() %>%
  filter(.metric == 'accuracy')

# Get predictions for .row = 1 
glmnet_res %>%
  collect_predictions() %>%
  filter(.row == 1)

# Get loocv AUC
glmnet_res %>%
  collect_predictions() %>%
  dplyr::group_by(mixture, penalty, .config) %>%
  summarise(loocv_auc = roc_auc_vec(truth = group, estimate = .pred_group2))

# Print model estimates for Fold25 which is for .row == 1
glmnet_res %>%
  unnest(.extracts) %>%
  rename(penalty1 = penalty) %>%
  unnest(.extracts) %>%
  select(id, term, estimate) %>%
  filter(id == 'Fold25')

# For the single model I don't know how to identify which fold was for .row == 1
# but none of the intercepts = 0.538 which is what is it for Fold25 of the tuned
# model suggesting the underlying models LOO sample are different
resamp %>%
  unnest(.extracts) %>%
  unnest(.extracts) %>%
  select(id, term, estimate) %>%
  filter(term == '(Intercept)',
         estimate == 0.568)

I think that there are a few things going on here.

First, I would make sure that, if you want to get equal models, set the path_values argument to be the same sequence of values (and not a single value).

Second, we don't really support LOOCV in any format and we should not allow

 vfold_cv(df,  v = nrow(df))

to work.

For LOOCV, you get n separate predictions and compute one performance measure. With the vfold_cv() code above, it is computing n performance measures and each is based on n=1 data points. This are not the same things and we should throw an error if v is too large (relative to the number of rows in the training set).

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.