Problem with using lmer model in fit_resamples()

ildifa · May 23, 2022, 1:25pm

I am trying to use the tidymodels package to build a linear mixed model. It looks like I'm specifying the formula in the correct way, as I can run a fit() on the workflow.
However, when I try to run it on resamples of my data, using the function fit_resamples(), I get an error relative to missing factor levels.
It is not clear to me if I'm doing something wrong, or if the packages "multilevelmod" and "tune" are not compatible in this way, and so I would greatly appreciate any advice.
I have included a reprex using the "mpg" dataset.

packages = c(
  "tidyverse",
  "tidymodels",
  "multilevelmod"
)

lapply(packages, library, character.only = TRUE)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
#> Warning: package 'parsnip' was built under R version 4.1.3
#> Warning: package 'multilevelmod' was built under R version 4.1.3
#> [[1]]
#>  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
#>  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
#> [13] "utils"     "datasets"  "methods"   "base"     
#> 
#> [[2]]
#>  [1] "yardstick"    "workflowsets" "workflows"    "tune"         "rsample"     
#>  [6] "recipes"      "parsnip"      "modeldata"    "infer"        "dials"       
#> [11] "scales"       "broom"        "tidymodels"   "forcats"      "stringr"     
#> [16] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
#> [21] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
#> [26] "utils"        "datasets"     "methods"      "base"        
#> 
#> [[3]]
#>  [1] "multilevelmod" "yardstick"     "workflowsets"  "workflows"    
#>  [5] "tune"          "rsample"       "recipes"       "parsnip"      
#>  [9] "modeldata"     "infer"         "dials"         "scales"       
#> [13] "broom"         "tidymodels"    "forcats"       "stringr"      
#> [17] "dplyr"         "purrr"         "readr"         "tidyr"        
#> [21] "tibble"        "ggplot2"       "tidyverse"     "stats"        
#> [25] "graphics"      "grDevices"     "utils"         "datasets"     
#> [29] "methods"       "base"

data(mpg, package = "ggplot2")

training = mpg %>%
  initial_split() %>%
  training()

training_folds = training %>%
  validation_split()

lmm_model = linear_reg() %>% 
  set_engine("lmer")

lmm_workflow = workflow() %>% 
  add_variables(outcomes = cty,
                predictors = c(year, manufacturer, model)) %>% 
  add_model(lmm_model, formula = cty ~ year + (1|manufacturer/model))

# A simple fit works
fit(lmm_workflow, training)
#> == Workflow [trained] ==========================================================
#> Preprocessor: Variables
#> Model: linear_reg()
#> 
#> -- Preprocessor ----------------------------------------------------------------
#> Outcomes: cty
#> Predictors: c(year, manufacturer, model)
#> 
#> -- Model -----------------------------------------------------------------------
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: cty ~ year + (1 | manufacturer/model)
#>    Data: data
#> REML criterion at convergence: 831.6647
#> Random effects:
#>  Groups             Name        Std.Dev.
#>  model:manufacturer (Intercept) 2.913   
#>  manufacturer       (Intercept) 2.328   
#>  Residual                       1.936   
#> Number of obs: 175, groups:  model:manufacturer, 38; manufacturer, 15
#> Fixed Effects:
#> (Intercept)         year  
#>   -64.37893      0.04037

# A fit with resamplings doesn't work:
fit_resamples(lmm_workflow, resamples = training_folds)
#> x validation: preprocessor 1/1, model 1/1 (predictions): Error: Assigned data `factor(...
#> Warning: All models failed. See the `.notes` column.
#> Warning: This tuning result has notes. Example notes on model fitting include:
#> preprocessor 1/1, model 1/1 (predictions): Error: Assigned data `factor(lvl[1], levels = lvl)` must be compatible with existing data.
#> x Existing data has 44 rows.
#> x Assigned data has 0 rows.
#> i Only vectors of size 1 are recycled.
#> # Resampling results
#> # Validation Set Split (0.75/0.25)  
#> # A tibble: 1 x 4
#>   splits           id         .metrics .notes          
#>   <list>           <chr>      <list>   <list>          
#> 1 <split [131/44]> validation <NULL>   <tibble [1 x 1]>

^{Created on 2022-05-23 by the reprex package (v2.0.1)}

Max · May 23, 2022, 9:25pm

I suggest converting the character columns to factor before running the code. If is loosing levels and, when it makes them factors, it can't find some levels.

ildifa · May 25, 2022, 3:44pm

Thank you for your suggestion!
However, even when mutating those column to factors I get the same error. It looks like the problem is that there are different combinations of factors in the analysis and assessment sets.
I have found that by using the allow.new.levels = TRUE argument, it is possible to use the trained model to predict() even on the missing combinations, but I cannot find a way to specify this parameter in the recipe or in the workflow.

library(tidyverse)
library(tidymodels)
library(multilevelmod)

set.seed(1243)

data(mpg, package = "ggplot2")

training = mpg %>%
  initial_split() %>%
  training() %>% 
  mutate(manufacturer = manufacturer %>% as_factor(),
         model = model %>% as_factor())

training_folds = training %>%
  validation_split()

lmm_model = linear_reg() %>% 
  set_engine("lmer")

lmm_recipe = recipe(cty ~ year + manufacturer + model, data = training) %>%
  step_novel(manufacturer, model)

lmm_formula = cty ~ year + (1|manufacturer/model)

lmm_workflow = workflow() %>% 
  # see: https://stackoverflow.com/questions/68183077/tidymodels-novel-levels-found-in-column
  add_recipe(lmm_recipe, blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>% 
  add_model(lmm_model, formula = lmm_formula)

# A simple fit works
fit(lmm_workflow, training)
#> == Workflow [trained] ==========================================================
#> Preprocessor: Recipe
#> Model: linear_reg()
#> 
#> -- Preprocessor ----------------------------------------------------------------
#> 1 Recipe Step
#> 
#> * step_novel()
#> 
#> -- Model -----------------------------------------------------------------------
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: cty ~ year + (1 | manufacturer/model)
#>    Data: data
#> REML criterion at convergence: 864.9986
#> Random effects:
#>  Groups             Name        Std.Dev.
#>  model:manufacturer (Intercept) 2.881   
#>  manufacturer       (Intercept) 2.675   
#>  Residual                       2.181   
#> Number of obs: 175, groups:  model:manufacturer, 38; manufacturer, 15
#> Fixed Effects:
#> (Intercept)         year  
#>    -8.06202      0.01228

# A fit with resamplings doesn't work:
fit_resamples(lmm_workflow, resamples = training_folds)
#> Warning: package 'lme4' was built under R version 4.1.1
#> x validation: preprocessor 1/1, model 1/1 (predictions): Error:
#> ! Assigned data `facto...
#> Warning: All models failed. See the `.notes` column.
#> # Resampling results
#> # Validation Set Split (0.75/0.25)  
#> # A tibble: 1 x 4
#>   splits           id         .metrics .notes          
#>   <list>           <chr>      <list>   <list>          
#> 1 <split [131/44]> validation <NULL>   <tibble [1 x 3]>
#> 
#> There were issues with some computations:
#> 
#>   - Error(s) x1:  ! Assigned data `factor(lvl[1], levels = lvl)` must be compatibl...
#> 
#> Use `collect_notes(object)` for more information.

# It seems that the problem is that the combinations of factor levels differ
# between analysis and assessment set
analysis_set = analysis(training_folds$splits[[1]])
assessment_set = assessment(training_folds$splits[[1]])

identical(
  analysis_set %>% distinct(manufacturer, model),
  assessment_set %>% distinct(manufacturer, model)
)
#> [1] FALSE

# directly fitting the model on the analysis set
analysis_fit = lmer(formula = lmm_formula, data = analysis_set)

# predicting the values for the missing combinations of levels is actually possible
# see: https://stackoverflow.com/questions/29259750/prediction-with-lme4-on-new-levels
assessment_predict = analysis_fit %>%
  predict(training_folds$splits[[1]] %>% assessment(),
          allow.new.levels = TRUE)

^{Created on 2022-05-25 by the reprex package (v2.0.1)}

system · June 15, 2022, 3:45pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.