Dear,
At the end of a machine learning pipeline using tidymodels, is it advisable to perform a final model fitting with the entire dataset?
According to the Law of Large Numbers, it is important to have independent samples for training and testing since, to evaluate a model (using mean squared error, for example), we need (y_i - g(x))^2 to be independent. Therefore, the MSE converges in probability to the expected value of the random variable (Y_i - g(X))^2.
Therefore, I understand that:
1 - The test set is very important for estimating the model's predictive risk. 2 - Cross-validation is useful so that we can evaluate different combinations of hyperparameters and select the best combination on the validation set.
However, after all of this, once the hyperparameters have been estimated using procedures such as cross-validation, bootstrap, or nested cross-validation, and after we have assessed the predictive risk of g, is it appropriate to proceed with a fit
using the entire dataset (training + test) to improve the parameter estimates of the model?
I think that the division between training and cross-validation sets is useful for two things:
1 - Estimating the model's hyperparameters. 2 - Measuring the predictive risk of the model since we need the MSE to converge to the true expected value.
This suggests that it would not be detrimental to perform a final adjustment of the model using the entire dataset (training and test sets). Do you do this?
I see many analysis scripts where people perform a last_fit
at the end of the analysis, using the entire training set, and the final model is evaluated on the test set. And that's perfectly fine since I need an independent sample from the one used in training to estimate the predictive risk of the model.
After all of this, I think it wouldn't hurt to proceed with a final adjustment of the model using the entire dataset, as I already know the hyperparameters and the actual risk of the model returned by the last_fit
function. Does anyone else think the same way?
Best regards.