I am running a algorithms comparison in a classification task in a small dataset (<100 observations), whose data is clustered by subjects (>15 clusters), and the classes are imbalanced 70/30. Based on these conditions, I am using a repeated k-fold cross validation approach.
Do you know if rsample (or another package) has an implementation to consider clustered data and imbalance for the data stratification? i.e. split in K folds with each one keeping the class ratio 70/30, while each cluster is only in one fold per repetition. Do you see this "stratification+grouping" strategy valid?
Due to the overlapping between the training sets in k-CV, and the correlation between the observations (within clusters), which metric + statistical estimator would you use for algorithms comparison?
Regarding accounting for correlation between repeated measurements on the same subject, multilevelmod provides parsnip engines for a class of models to do that.
Dear Hannah,
Thanks for the answer and the suggestions!
I have already looked at group_vfold_cv(), as you said, but my dataset does not accomplish that restriction. So I thought I would have to code a function by myself without that restrictions.
Multilevelmod is related to type of algorithm (e.g. mixed effects), do you have a suggestion for metrics and statistical estimator to consider overlapping between the training set and clustered data as well?