Dear R Studio Community,
I hope that I can inquire your expertise regarding a prediction task in R/Tidymodels. I intend to predict injuries in runners. The daily/weekly training data, on which the predictions are based on, is thereby nested in the individual runners over a timeframe of a few months. This made me consider multilevel models - multilevel binary logistic regression (MLBLR) specifically.
As the data is also very imbalanced I further tried to engage in resampling via SMOTE. Because half the runners did not incur injuries and the other half mostly only one I am additionally uncertain of the success of this undertaking, as there will be none or only one injury instance per runner within the training set to base the resampling on, and consequently no injury instance in the test set for runners with observed injuries within the testing set. This makes the SMOTE resampling most likely not possible.
So far I tried to manually predict injuries via a MLBLR without resampling and by only adapting the prediction probability threshold, with the outcome of only negative predictions because of the unbalanced nature. Understandably, I did not manage to resample via SMOTE in this scenario, should I rather look at other methods like e.g., undersampling non-injury instances or are there any specific resampling procedures (preferably synthetic data creation) for multilevel data, taking the nested structure into account?
I further tried to implement multilevel modelling in the preferred Tidymodel workflow, as resampling is also made easy there. Thereby, I looked firstly at the "multilevelmod" package which induces multilevel engines (lme4) to the workflow. Secondly, I tried to make use of the many models structure by nesting by each runner and then applying models to it. Unfortunately, I only did get the latter method working. Former, I used most likely incorrectly "stan-glmer" as an engine (Code 1), latter I made working with mixed results via simple oversampling (Screenshot 2 - ). Thirdly, I am not sure whether to additionally look at fitting generalized linear models using mixed models via the embed package in Tidymodels.
I would be very grateful to hear your take on this, specifically how to approach this issue of implementing a multilevel model + resampling in the Tidymodels workflow. Thank you very much in advance.
Kind regards!
Multillevelmod: https://github.com/tidymodels/multilevelmod
Many Models: ttps://r4ds.had.co.nz/many-models.html
Embed: ttps://embed.tidymodels.org/articles/Applications/GLM.html
Code 1:
mlbr_mod <- linear_reg() %>% set_engine("stan-glmer")
# Recipe:
mlbr_mod_recipe <- recipe(NewRRI ~Distance + HR + Gender + Age + BMI +
PreviousRRI + Runner, data = RunningData_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_string2factor(Runner) %>%
step_smote(NewRRI, over_ratio =0.5)
mlbr_mod_workflow <- workflow() %>% add_recipe(mlbr_mod_recipe) %>%
add_model(mlbr_mod, formula = NewRRI ~ . -Runner + (1|Runner))
# Fit the model:
mlbr_mod_workflow %>% fit(data = RunningData_train)
# Train on original set and test on test set using last_fit()
mlbr_last_fit <- mlbr_mod_workflow %>% last_fit(RunningData_splits,
metrics = metric_set(bal_accuracy, accuracy, f_meas, precision,
roc_auc,sensitivity, recall, kap))
# Performance on test set:
mlbr_metrics <- mlbr_last_fit %>% collect_metrics()
mlbr_metrics
The code fails at the step where I try to fit the model. There it gives the error message that it can't subset columns that don't exist - X Column 'Patient' doesn't exist. The input data is structured like this:
Runner - factor: A1, A1, A1, B1, B1, B1, C1, C1, C1 ... (=IDs)
NewRRI - factor: 0, 0, 1, 0, 0, 0, 0, 0, 0 ...
Distance - numeric:340,500,734,110,389,766,833,420,1100 ...
HR - numeric: 120,110,130,142,98, 112,104,117,130 ...
Gender - factor: Male,Female,Male,Male,Male,Female,Male,Female,Female, ...
Age - numeric: 23, 36, 56, 35, 67, 24, 52, 39, 29, ...
BMI - numeric: 18, 20, 21, 25, 23, 24, 21, 22, 20, ...
PreviousRRI -factor:0, 0, 1, 0, 0, 1, 1, 0, 0, ...