Is there a way to tune_grid() a glmnet without resamples?

pathos · September 13, 2021, 1:09pm

Below is my current code, where

enet is elastic net model
recip is recipe
pen_mix_grid is different combinations of penalty and mixture values for glmnet, and
folds is a vfold_cv() resample

    outputt = tune_grid(enet,
                        preprocessor = recip,
                        grid = pen_mix_grid,
                        resamples = folds)

I would like it to only try different penalty and mixture values of glmnet, so I tried resamples = NULL and removing the argument without success. How can I make it ignore resamples argument?

Max · September 13, 2021, 1:34pm

What data would you like to model and what data would you like to predict?

pathos · September 14, 2021, 8:41am

Thanks for the reply.

Data would be df, and the formula would be y ~ ., both of which are included in recip.

Max · September 14, 2021, 9:39am

I should rephrase: which rows of the data do you want for prediction and modeling?

tidymodel avoids predicting on data that was used to fit a model since it can easily lead to overfitting. Almost all of our resampling objects (which are required here) separate the two data sets.

pathos · September 14, 2021, 11:47am

Oh I see -- df is already the train data without test data (let's say df_train and df_test). I would like it to fit on all data (train + test, say df_all). Please let me know if that answers your question.

Sorry, before your questions, I guess I didn't quite understand the relevance, but I should have anticipated this. It makes a bit more sense why NULL resamples argument doesn't work.

Max · September 14, 2021, 1:38pm

Here are some options (clearly I prefer one )

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
tidymodels_prefer()
theme_set(theme_bw())

# best: subsplit the training data to use a validation set
set.seed(1)
resample <- validation_split(mtcars)
resample
#> # Validation Set Split (0.75/0.25)  
#> # A tibble: 1 × 2
#>   splits         id        
#>   <list>         <chr>     
#> 1 <split [24/8]> validation

# not great: put the training set in both training and test. 
# we'll do this via bootstrapping with an extra option
set.seed(1)
resample <- 
  bootstraps(mtcars, times = 1, apparent = TRUE) %>% 
  filter(id == "Apparent")
resample
#> # A tibble: 1 × 2
#>   splits          id      
#>   <list>          <chr>   
#> 1 <split [32/32]> Apparent

res <- 
  linear_reg() %>% 
  # Same code works for tune_grid
  fit_resamples(mpg ~ ., resample = apparent(mtcars))

^{Created on 2021-09-14 by the reprex package (v2.0.0)}

pathos · September 14, 2021, 6:17pm

Hmm I'm having trouble understanding this.

So let's say glmnet goes into this. Where would penalty and mixture be brute forced?

Max · September 14, 2021, 9:25pm

If you want to tune them give them a value of tune() and use tune_grid() as before.

pathos · September 17, 2021, 1:42pm

Just for future reference --

I'm just discovering that when using vfold_cv() or sliding_window() etc. instead of bootstrap with the same code, the output is model weights, not covariate weights. I guess only some resampling methods or only bootstrap returns covariate weights. Bit of a bummer, but oh well, bootstrapping it is.

Thanks a million

UglyDuckling · September 17, 2021, 3:36pm

If you want the coefficients from the fitted models, you can get those by extracting the models. I.e. something like the control = control_grid( extract = function (x) x) ) in tune_grid (if this gets too big, I think you could butcher what you return further to only keep the coefficients). Then you can look at tune_results_object$.extracts and process that further. Or is that not what you meant?

Max · September 17, 2021, 4:38pm

We just wrote an article about this.

pathos · September 21, 2021, 5:38pm

In the article tidymodels - Working with model coefficients

But wait! We know that each glmnet fit contains all of the coefficients. This means, for a specific resample and value of mixture , the results are the same:

all.equal(
  # First bootstrap, first `mixture`, first `penalty`
  glmnet_res$.extracts[[1]]$.extracts[[1]],
  # First bootstrap, first `mixture`, second `penalty`
  glmnet_res$.extracts[[1]]$.extracts[[2]]
)
#> [1] TRUE

I've been trying to figure out the reason for the lack of change with different values of penalty as mentioned above, but I haven't had much luck. Why would different values of penalty result in the same coefficients? In the last graph, it seems to show different values for different penalties: https://www.tidymodels.org/learn/models/coefficients/figs/glmnet-plot-1.svg

Max · September 21, 2021, 6:59pm

glmnet models produce coefficients for all value of the penalty for each model fit. The infrastructure in tidymodels gives a row for each penalty but the tidy method produces coefficients for all of them, so there will be replicate values. Take a look at this document for more details.

system · September 28, 2021, 6:59pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.