Data spending in the world of "small data"

As the title of this post suggests, I am asking about strategies regarding data spending when developing models on small datasets.

In my area of work it is not uncommon for us to run a study on a niche topic where it is either very difficult or very expensive for us to find participants (maybe only 75-150 records or so).

In these situations the primary purpose of modeling is to understand the effects the predictors have on the outcome, and not to deploy a predictive model. That being said, I understand the importance of building a model that generalizes well, even if that isn't the primary goal.

Can anyone suggest best practices, strategies, or resources on the topic of data spending (e.g., train/test splits, cross-validation, etc.) with small data?

Thanks!

1 Like

Try to always have a test set, even it can only produce gross estimates of performance.

Other than that, lean into using a large number of resamples and also use it to measure the uncertainty in your statistical evaluations of the data.

Thank you, this makes sense. In light of your suggestion do you ever think there is right time and place for leave-one-out cross validation? I rarely (ever?) use it for fear of an overoptimistic measurement of accuracy, and since it will only produce one measurement rather than a distribution.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.