grid_max_entropy() not resulting in desired grid size

Hi everyone.

Below is my model specification for hyperparameter tuning

set.seed(13)
options(scipen=999)
 
library(tidymodels)
 
library(tune)
library(finetune)
 
#multivariate adaptive regression splines (MARS)

#model specification
 
spec_tuned_mars <- mars(prune_method = tune(),
                        num_terms = tune(),
                        prod_degree = tune()) %>%
  set_engine("earth") %>%
  set_mode("regression")

When I want to set up the hyperparameter grid using the grid_max_entropy() command, it does not produce what I expect

> #search grid
> 
> grid_mars <- grid_max_entropy(hardhat::extract_parameter_set_dials(spec_tuned_mars), size = 48)
> 
> dim(grid_mars)
> [1] 32  3
> 
>  grid_mars <- grid_max_entropy(hardhat::extract_parameter_set_dials(spec_tuned_mars), size = 100)
>  
>  dim(grid_mars)
> [1] 43  3
> 
> grid_mars <- grid_max_entropy(hardhat::extract_parameter_set_dials(spec_tuned_mars), size = 200)
> 
> dim(grid_mars)
> [1] 47  3
> grid_mars <- grid_max_entropy(hardhat::extract_parameter_set_dials(spec_tuned_mars), size = 220)
> 
> dim(grid_mars)
> [1] 46  3
> 
> grid_mars <- grid_max_entropy(hardhat::extract_parameter_set_dials(spec_tuned_mars), size = 300)
>
> dim(grid_mars)
> [1] 48  3

sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /software/spackages_test/apps/linux-ubuntu20.04-zen2/gcc-10.3.0/r-4.1.3-jzktkndirbjgryr6s3rrvdpgqlnnm65t/rlib/R/lib/libRblas.so
LAPACK: /software/spackages_test/apps/linux-ubuntu20.04-zen2/gcc-10.3.0/r-4.1.3-jzktkndirbjgryr6s3rrvdpgqlnnm65t/rlib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] qs_0.25.5          butcher_0.3.3      finetune_1.1.0     yardstick_1.1.0    workflowsets_1.0.0 workflows_1.1.0    tune_1.1.2         tidyr_1.3.0       
 [9] tibble_3.2.1       rsample_1.2.0      recipes_1.0.7      purrr_1.0.2        parsnip_1.1.1      modeldata_1.0.1    infer_1.0.3        ggplot2_3.4.3     
[17] dplyr_1.1.2        dials_1.1.0        scales_1.2.1       broom_1.0.5        tidymodels_1.0.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.11         lubridate_1.9.2     lattice_0.20-45     listenv_0.9.0       class_7.3-20        digest_0.6.33       ipred_0.9-14        foreach_1.5.2      
 [9] utf8_1.2.3          parallelly_1.36.0   R6_2.5.1            backports_1.4.1     hardhat_1.3.0       pillar_1.9.0        rlang_1.1.1         rstudioapi_0.14    
[17] data.table_1.14.8   DiceDesign_1.9      furrr_0.3.1         rpart_4.1.16        Matrix_1.6-1        splines_4.1.3       gower_1.0.1         munsell_0.5.0      
[25] compiler_4.1.3      pkgconfig_2.0.3     globals_0.16.2      nnet_7.3-18         tidyselect_1.2.0    prodlim_2023.03.31  codetools_0.2-18    GPfit_1.0-8        
[33] fansi_1.0.4         future_1.33.0       withr_2.5.1         MASS_7.3-58.1       grid_4.1.3          gtable_0.3.4        lifecycle_1.0.3     magrittr_2.0.3     
[41] RcppParallel_5.1.7  future.apply_1.11.0 cli_3.6.1           timeDate_4022.108   lhs_1.1.5           generics_0.1.3      vctrs_0.6.3         stringfish_0.15.8  
[49] RApiSerialize_0.1.2 lava_1.7.2.1        iterators_1.0.14    tools_4.1.3         glue_1.6.2          parallel_4.1.3      survival_3.4-0      timechange_0.2.0   
[57] colorspace_2.1-0

None of the tuning parameters are real valued (with character or integers). num_terms depends on the number of columns in the data.

This means that there is a finite number of possible tuning parameter combinations that you can create.

Let's look at the default values:

library(tidymodels)

spec_tuned_mars <- mars(prune_method = tune(),
                        num_terms = tune(),
                        prod_degree = tune()) %>%
  set_engine("earth") %>%
  set_mode("regression")

param_info <- 
  spec_tuned_mars %>% 
  extract_parameter_set_dials()

# num_terms: **default** range is 4 possible values
param_info$object[[1]]
#> # Model Terms (quantitative)
#> Range: [2, 5]

# prod_degree: only two possible values
param_info$object[[2]]
#> Degree of Interaction (quantitative)
#> Range: [1, 2]

# prune_method: 6 possible values
param_info$object[[3]]
#> Pruning Method  (qualitative)
#> 6 possible values include:
#> 'backward', 'none', 'exhaustive', 'forward', 'seqrep' and 'cv'

Created on 2023-11-09 with reprex v2.0.2

So there are 4 * 6 * 2 = 48 possible combinations so that's your maximum grid size.

The current space-filling designs use some randomness and it is certainly possible to get duplicate grid points with so few possible values.

If your data set has a lot of columns, you can increase the default for num_terms using update():

spec_tuned_mars_wide <- 
  spec_tuned_mars %>% 
  update(num_terms = num_terms(c(2, 100)))
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.