I want to call the best training model among ten others:
Bestmodel<- which.max(models$r2)
Bestmodel
[1] ols
+My main/first problem is that I would like to have the estimation printed instead of having only the name of the fit « ols ». Please how to do?
+My second question is to know if there is a function to call in Caret package to extract the estimated parameters of the best model like coef(Bestmodel)
+My last question is large: do you think the best training model (trained sample) has more chances to be the best in prediction (test sample)?
No, the specifics depend on the context, but in general, your training error will continue decreasing with more parameters, while the testing error will have a "sweet spot", decreasing first then increasing (overfitting). That's why you should select the best model on a testing or validation error, not on the training error. Typically cross-validation can be used.
In addition, once you've chosen the best model (based on its cross-validation error), it's common to re-train it using the whole dataset (training + validation), to yield a final model (keeping an additional test sample not used for model selection to evaluate the generalization error of the final model).
I'm not familiar with {caret}, but it should have all the functionality to extract the best model, see here, and in ?train the return value (specifically bestTune).
Thank you very much for clear explanation !
I read all the discussions on the two links thanks really.
I will see the clear difference between validation/test set.
It was a very important question. My most basic queston remained unsolved although it is very basic. I was just asking how to extract the best model, because when i call the best model, R send me a response like for example [1] model_5 that i record in "best-model", but when i type best-model on the script i get still [1] model_5 instead of the estimate of the model 5.
Please let me put it simply.
I fit several OLS models (model1, model2 etc.which are all recorded in MODELS).
I write a function to select the model with highest R2: which.max(MODELS$r2).
I would like to see the estimation of the best model when i call it throught the which.max function (not only the name of the model to be displayed).
Thanks a lot.
So model$bestTune gives you the parameters of the best model, model$results accesses all the tested model metrics.
If MODELS is a list that you created with an sapply() or for loop, then which.max() should give you the index, that you can use to extract from the list: MODELS[[ which.max(MODELS$r2) ]]
I'm not sure how you created MODELS and what class() it has.
Thank you very much for your response. I agree.
But how to find the same metrics when calling the best model among several models. Below an basic fictive example.
Can you paste the result of head(MODELS) between backquotes:
```
head(MODELS)
```
It's unclear what format your data is in: is it a tibble with a single column which is a character string containing "model1 1.00", "model2 0.98", ...? In that case which.max() should fail (because the max of a character string is meaningless). Or do you have 2 columns, contrary to the header that says:
How did you run the models? Are the models themselves stored somewhere? What's the result of these:
Thank you for your message. I'm sorry about the confusion, i did not report the code, instead i invent some code lines to explain the situation basically.
So it looks like you didn't save the models themselves: your first column is character, so only contains the name of the model, the other 2 columns only contain values.
The model itself should be a "big" and complex object, contained in the results of the train() function. In the page I linked above, it's the case of gbmFit3 which contains the details of the fit.
Please let me put it differently. Imagine your caret code with two models, say 'gbm' and 'lm'. Now call the model with the highest R2. Can you do it? Thank you.
Right. Thank you. I did it but when i write : get(best_model), i have an error message telling me that the object (best_model) does not exist, which is true because it is a string that correspond to for example 'model1' which an object.
Thank you and soory about that. Please let me see again how to solve this issue based your numerous comments.
Please not that i have a second issue that i posted on which sampling approach is best for time series. Many thanks.