How to deal with number of trees in Bayesian hyperparameter search for xgboost (and LightGBM)

UglyDuckling · September 16, 2021, 10:39am

Background: I am trying to do a Bayesian hyperparameter search for boosted tree based models (xgboost / LightGBM) using tune::tune_bayes(). There are a number of hyperparameters I want to tune, but one is sort of special: trees. This is already recognized to some extent by the tune package, for example if I create a grid of hyperparameters and for each set ask for all values of trees from 1 to 1000, then tune_grid understands that it should just create a model with 1000 trees and should additionally evaluate (and save the evaluation) after each additional tree has been created (i.e. by trying trees=1000 you can evaluate 1:999 almost for free at the same time). That's of course rather specific to boosting, because of the way the trees are sequentially created by looking at what the previous ones still get wrong.

What is my problem/am I right that this is a problem? However, I cannot figure out (maybe I overlooked something) how to get tune_bayes() to behave in such a way. When I specify that trees should be tuned - e.g. via boost_tree(mtry = tune(), trees = tune()) - then it will tune_bayes() seems to try one specific value trees such as 277 and only retains the metric of interest for trees=277 (but not for 1:276). If I specify a fixed number of trees with boost_tree(mtry = tune(), trees = 1000), then I just get the results for trees=1000 (but not for 1:999). Or did I misunderstand what tune_bayes() does and it actually does what I want (in which case a clearer description in the documentation would probably be good)?

Main question: If my understanding above is correct, then it is non-ideal that tune_bayes() does not explore trees efficiently, both in terms of finding a good value for trees, but also for all the other parameters (you may have picked a great combination, but just cannot see it because of a poor choice for trees). I'd be quite happy to specify an upper bound myself for trees, I just want to also automatically evaluate all values below that. Any pointers on how to get tune_bayes() to do that?

One - somewhat inferior - approach for xgboost: One could use trees=1100, stop_iter=100 to do early stopping, when I believe the validations score at the best iteration would be returned (so effectively getting me the validation metric for the trees value at which the validation metric was lowest, right?). I just need to be a bit careful about my choice of stop_iter, it especially must not be too small so that we do not wrongly stop early when the model was just fluctuating a bit would have improved later. Additionally, there is the akward scenario where the performance at iteration 1100 is a lot worse than at 1050, but early stopping has not occured, so we get the performance for trees=1100 back, I think. I.e. one probably has to make trees quite large, which could then occassionally waste some time. I believe this is not an option for LightGBM at the moment, because treesnip does not support stop_iter, yet.

Max · September 20, 2021, 1:05pm

TBH these boosting libraries are some of the easiest to tune. There tends to be a flat region of performance where many different parameter combinations give very good performance.

It's not very exciting but random search or, better yet, a space filling design is pretty good.

Learning rate often has the largest effect in almost every model that I've fit or seen.

UglyDuckling · September 20, 2021, 2:25pm

Regarding, bayes_tune can you clarify what the tune library is doing? I'm increasingly suspecing bayes_tune is already trying out the whole range of trees values and only doing the Bayesian model for the other parameters (I'm mostly saying that, because I got to good results so quickly that I suspect this to be the case). Great, if so.

Regarding the learning rate, surely you want it as low as you can afford (i.e. you still need to be able to do your hyperparameter tuning in an acceptable amount of time) and just fix it to that (say, 0.01 or 0.005). As far as I'm aware lower is meant to always be better in terms of prediction quality (with the right choice of other hyperparameters), but it would be interesting if there were counter-examples. Which is why I'm used to just tuning the other parameters for a fixed (as low as possible) learning rate. Obviously, the needed number of trees will be larger for a lower learning rate (and good choices for some other hyperparameters might change too), which is why everything then takes longer.

Max · September 20, 2021, 5:07pm

It is spelled out here in detail. It does consider the entire range of the parameter, especially early on when the priors are the largest part of the posterior. Apart from that, it goes where it thinks performance is better. You can adjust the tradeoff between exploration and exploitation fairly easily.

UglyDuckling · September 21, 2021, 9:50am

Just to confirm: bayes_tune treats trees just like any other parameter and, if the surrogate model for the hyperparameters proposes trees=76, it in no way looks at trees=1:75, nor at higher numbers to the maximum size of the search space?

That's what I've been trying the find out and the chapter you linked, which I had read, does not spell this out (which I guess means "trees is just like any other parameter for bayes_tune and no attempt is made to exploit what we understand of the model structure")?

Max · September 21, 2021, 6:55pm

That is correct, the next iteration evaluates a single combination of parameters.

system · September 28, 2021, 6:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.