Thank you in advance for the read. I have been working with a data set at work using xgboost (via caret), setting my seed for reproducibility and tuning the parameters. When I use expand.grid I am able to get a higher accuracy on the model (and better prediction of my test set) than when I use the same parameters (found via model$results$besttune) in expand.grid without any sequence. I've done my best to generate a reproducible example but am having a hard time doing so. This leads me to think that it may be because my model is overfit. Please note that in my real-world model, I've shrunk the expand.grid to a more optimized size (in case that is someone's suggestion). I've also removed the seed to see how stable the model accuracy is and it is definitely quite variable (76% on the test set is the highest I've seen and 6 other models give 61%-73%)
Any ideas on why this is? In my real world work, the accuracy goes from 76% on the test set down to about 71% on the test set with this one change. Test set is 20% of the data (n = 167)
In case it helps, the grid search is:
max_depth = c(3, 4, 5),
nrounds = seq(from = 25, to = 95, by = 10),
eta = c(0.025, 0.05, 0.1),
gamma = 0,
colsample_bytree = c(0.6,0.8),
min_child_weight = 1,
subsample = 1
The best tune is:
max_depth = 3,
nrounds = 65,
eta = 0.1,
gamma = 0,
colsample_bytree = 0.6,
min_child_weight = 1,
subsample = 1
Since I can't come up with a reprex that actually works (I tried three different times and got stable results), I am asking this in a more theory sense rather than "how do I make this code work."
For discussions related to modeling, machine learning and deep learning. Related packages include caret
, modelr
, yardstick
, rsample
, parsnip
, tensorflow
, keras
, cloudml
, and tfestimators
.