Single Test Set to estimate future performance of ML models?

fjodor · May 18, 2020, 8:16pm

First of all, thanks to everyone involved for all the work that has gone into tidymodels.

There's one concept that has been boggling me for a while - maybe someone can point me in the right direction. A link to a good explanation would suffice.

From my own experiments and from literature (Applied Predictive Modeling) I have in mind: "Don't rely on a single test set". I found substantial variation in test set error just by changing the seed to set aside testing data. So I thought the answer to that shortcoming was cross validation (or similar resampling techniques): assessing performance across multiple sets of data unseen during model building. I have used caret to do that. So my answer to expected future performance would be cross validated performance measures.
In blog posts introducing tidymodels, I have frequently seen this idea of first setting aside a single test set, then performing model building using a number of steps for feature engineering, parameter tuning etc., all based on cross validation / resampling, and then estimating future performance on the one single test set. Somehow, this approach does not fully convince me ... Why ending up with a single test set again?

I apologize for cross-posting - I got no answer here in a week: https://rviews.rstudio.com/2020/04/21/the-case-for-tidymodels/

Kind regards

Max · May 19, 2020, 4:09am

It really depends on your data. Certainly if you have "enough" it is possible to have a single holdout data set do a reasonable job of estimating performance.

In my pre-RStudio work, I used a single holdout about 5% of the time. In those cases, I knew a lot about the data (since we had been accruing more data for years) and there was a lot of it.

You can experiment with your data and see how well one holdout correlates to what you get with multiple resamples. If you feel comfortable then have at it. Otherwise, I'd suggest to stick to whatever you think is best for your particular data.

fjodor · May 19, 2020, 8:48am

Thanks, Max! I find your view very helpful. It's a general approach I found appealing in various contexts: Don't blindly follow a recipe, think about your specific use case ...

Max · May 19, 2020, 1:56pm

Well, maybe some recipes are worth following

system · May 26, 2020, 1:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.