Cross validation in clustered and imbalanced data: folds and estimators

Dear ML community,

I am running a algorithms comparison in a classification task in a small dataset (<100 observations), whose data is clustered by subjects (>15 clusters), and the classes are imbalanced 70/30. Based on these conditions, I am using a repeated k-fold cross validation approach.

  1. Do you know if rsample (or another package) has an implementation to consider clustered data and imbalance for the data stratification? i.e. split in K folds with each one keeping the class ratio 70/30, while each cluster is only in one fold per repetition. Do you see this "stratification+grouping" strategy valid?
  2. Due to the overlapping between the training sets in k-CV, and the correlation between the observations (within clusters), which metric + statistical estimator would you use for algorithms comparison?

Thanks in advance!

rsample can do stratified grouped resampling, e.g., via group_vfold_cv(). Note the repeats argument for repeated cv. However, strata need to be constant within each group, so you'd need to check if that is a restriction for you.

Regarding accounting for correlation between repeated measurements on the same subject, multilevelmod provides parsnip engines for a class of models to do that.

Dear Hannah,
Thanks for the answer and the suggestions!
I have already looked at group_vfold_cv(), as you said, but my dataset does not accomplish that restriction. So I thought I would have to code a function by myself without that restrictions.

Multilevelmod is related to type of algorithm (e.g. mixed effects), do you have a suggestion for metrics and statistical estimator to consider overlapping between the training set and clustered data as well?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.