Hi, I'm working on some data on heart failure clinics in order to predict death event numbers. I'm relatively new, so I'm confused on some of the concepts on RScript.
I am going to make a Logression model and CART models (decision tree and randomForest), I was wondering on how to find the best split ratio for my data? Or is this number purely arbitrary?
Additionally, should the split ratio remain constant across all three models, or is there an optimum number for each of them?
Training/testing split ratios are pretty arbitrary. They are usually chosen to provide a sufficient size of data to train a model with low over-fitting, and a sufficient test size to evaluate the overfitting. This is something that is best decided on by knowing the data structure and shouldn't really be considered a tuning parameter. In other words, its something based on human decision, rather that trying out many different splits to find the best model. This sort of data-drive splitting with more than likely lead to poor external generalizability for your model
If you want to compare the model performances to one another to see which model is "best", then it makes sense use the same training and testing splits for the various model types. Hope this is helpful.