Cross validation in clustered and imbalanced data: folds and estimators

Alberto_R · February 27, 2024, 7:29am

Dear ML community,

I am running a algorithms comparison in a classification task in a small dataset (<100 observations), whose data is clustered by subjects (>15 clusters), and the classes are imbalanced 70/30. Based on these conditions, I am using a repeated k-fold cross validation approach.

Do you know if rsample (or another package) has an implementation to consider clustered data and imbalance for the data stratification? i.e. split in K folds with each one keeping the class ratio 70/30, while each cluster is only in one fold per repetition. Do you see this "stratification+grouping" strategy valid?
Due to the overlapping between the training sets in k-CV, and the correlation between the observations (within clusters), which metric + statistical estimator would you use for algorithms comparison?

Thanks in advance!

hannah · February 27, 2024, 9:51am

rsample can do stratified grouped resampling, e.g., via group_vfold_cv(). Note the repeats argument for repeated cv. However, strata need to be constant within each group, so you'd need to check if that is a restriction for you.

Regarding accounting for correlation between repeated measurements on the same subject, multilevelmod provides parsnip engines for a class of models to do that.

Alberto_R · February 27, 2024, 10:23am

Dear Hannah,
Thanks for the answer and the suggestions!
I have already looked at group_vfold_cv(), as you said, but my dataset does not accomplish that restriction. So I thought I would have to code a function by myself without that restrictions.

Multilevelmod is related to type of algorithm (e.g. mixed effects), do you have a suggestion for metrics and statistical estimator to consider overlapping between the training set and clustered data as well?

system · March 19, 2024, 10:24am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.