Data spending in the world of "small data"

ttrodrigz · April 6, 2022, 2:18am

As the title of this post suggests, I am asking about strategies regarding data spending when developing models on small datasets.

In my area of work it is not uncommon for us to run a study on a niche topic where it is either very difficult or very expensive for us to find participants (maybe only 75-150 records or so).

In these situations the primary purpose of modeling is to understand the effects the predictors have on the outcome, and not to deploy a predictive model. That being said, I understand the importance of building a model that generalizes well, even if that isn't the primary goal.

Can anyone suggest best practices, strategies, or resources on the topic of data spending (e.g., train/test splits, cross-validation, etc.) with small data?

Thanks!

Max · April 12, 2022, 12:48am

Try to always have a test set, even it can only produce gross estimates of performance.

Other than that, lean into using a large number of resamples and also use it to measure the uncertainty in your statistical evaluations of the data.

ttrodrigz · April 12, 2022, 1:24am

Thank you, this makes sense. In light of your suggestion do you ever think there is right time and place for leave-one-out cross validation? I rarely (ever?) use it for fear of an overoptimistic measurement of accuracy, and since it will only produce one measurement rather than a distribution.

system · April 19, 2022, 1:24am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.