Is it appropriate to perform a final fit with the entire dataset?

Dear,

At the end of a machine learning pipeline using tidymodels, is it advisable to perform a final model fitting with the entire dataset?

According to the Law of Large Numbers, it is important to have independent samples for training and testing since, to evaluate a model (using mean squared error, for example), we need (y_i - g(x))^2 to be independent. Therefore, the MSE converges in probability to the expected value of the random variable (Y_i - g(X))^2.

Therefore, I understand that:

1 - The test set is very important for estimating the model's predictive risk. 2 - Cross-validation is useful so that we can evaluate different combinations of hyperparameters and select the best combination on the validation set.

However, after all of this, once the hyperparameters have been estimated using procedures such as cross-validation, bootstrap, or nested cross-validation, and after we have assessed the predictive risk of g, is it appropriate to proceed with a fit using the entire dataset (training + test) to improve the parameter estimates of the model?

I think that the division between training and cross-validation sets is useful for two things:

1 - Estimating the model's hyperparameters. 2 - Measuring the predictive risk of the model since we need the MSE to converge to the true expected value.

This suggests that it would not be detrimental to perform a final adjustment of the model using the entire dataset (training and test sets). Do you do this?

I see many analysis scripts where people perform a last_fit at the end of the analysis, using the entire training set, and the final model is evaluated on the test set. And that's perfectly fine since I need an independent sample from the one used in training to estimate the predictive risk of the model.

After all of this, I think it wouldn't hurt to proceed with a final adjustment of the model using the entire dataset, as I already know the hyperparameters and the actual risk of the model returned by the last_fit function. Does anyone else think the same way?

Best regards.

Theres a risk of overfitting to take account; without a held out sample, you are in danger of not knowing that you have done so, and perhaps your performance will be worse than your original fit that was more general.

Thank you very much for your response. Just to further discuss theoretically and open the floor for others to comment, I extend my question based on your answer.

I understand what you are talking about regarding the evaluation of the predictive risk of a model g in predicting observations of the random variable Y (label).

I need the sequence of random variables (Y_1 - g(X_1))^2, ..., (Y_m - g(X_m))^2, where m is the cardinality of the test set, to be independent and identically distributed. Therefore, I require a test set of cardinality m>0 that was not used to train the model to ensure independence. Hence, by the strong law of large numbers, we have:

\frac{1}{m} \sum_{i = 1}^m (Y_i - g(X_i))^2 \longrightarrow \mathbb{E}[(Y_i - g(X_i))^2] = R(g),

for a given i. The convergence above is almost certain, with probability one. Thus, we need a test set to have a sequence of independent random variables so that I can select a model properly. In other words, the test set guarantees that the mean squared error can provide a good estimate of R(g).

However, after obtaining a good estimate of R(g) and, therefore, being able to select the best model, since the hyperparameters at this stage are fixed and no longer dependent on the data to be estimated, why not use the entire sample (train + test) to re-estimate the other parameters that index the model, given that the estimates for these parameters are consistent?

I realize that there could be the optimism you refer to if I had used the training set to evaluate the model, that is, using the same training sample to assess R(g).

Best regards.

In other words, I believe that overfitting would occur due to contamination in the evaluation of R(g) if the training set is used for both training and evaluation. However, once R(g) has been evaluated and consistently estimated on a test sample, I agree that it should be done, as otherwise, we would not have a good estimate of R(g). After all, I have doubts about why some people do not estimate the parameters, for example, of a linear regression model (the parameters \beta_i), given that the least squares method estimates improve with increasing sample size.

Wouldn't the test set be important only for evaluating R(g)?

The test set serves as the substrate for a statistical estimator for a model built with other data.

Once you add that data to the training set and refit, your estimator isn't valid anymore. By how much? It depends.

If you have a lot of data, the inclusion of a relatively small amount won't completely invalidate the estimator.

If you only have a little bit of data, adding the test set will potentially help a lot. The downside is that your test set estimates are probably suspect. This adds a lot of risk for some potential benefit.

An alternative is to just use resampling on the entire pool of data and hope that your resampling scheme covers all of the important processes. Personally, I wouldn't do this (or add the test set in) but they are options.

Hi @Max , how are you? Thank you for your response. It seems to me that my idea is valid in some contexts.

The suspicion about the test set wouldn't be strictly for evaluating R(g) using mean squared error, for example, in a regression problem? In that case, there would be an optimistic mean squared error if we used the data to train and evaluate the model. I need independence to evaluate the model, and therefore, I necessarily need a test set.

Imagine a simple problem, for example, a polynomial regression, with or without regularization, it doesn't matter. In order for me to say that my model is the least bad among a set of models g_1, g_2, \ldots, I need to consistently estimate R(g). In order for me to benefit from the Law of Large Numbers and, consequently, benefit from the convergence of the mean squared error to the expected value of \mathbb{E}[(Y_i - g(X_i))^2], I necessarily need to have an independent sample, i.e., we need to have a test set. Then, suppose I chose g_1 as the best model and found a good hyperparameter for the degree of the polynomial through cross-validation.

However, when it comes to estimating \beta's, wouldn't more data be better, as the estimator enjoys good inferential properties asymptotically?

Maybe I'm still too focused on the Data Modeling Culture :sweat_smile:.

The good thing is that now it becomes clearer that in some situations, it may make sense to merge training and testing after model evaluation.

I will meditate a bit more on your response.

Best regards.

In other words, it's just that for me, after I decide on the model and the combination of hyperparameters on an independent dataset (test set), I can finalize the model and improve the estimates of \beta's with more data since I won't need to reassess the model's risk anymore, as I've already chosen the model. This is because I assume that improving the estimates of \beta's, in my example, is equivalent to reestimating with more data.

But, as you mentioned, this is not always true.

Best regards.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.