my machine is overfitting and not matching benchmarks

Hello all, I hope everyone is doing well.

I am having the following issue: when there is a random component to the training process of the machine learning algorithm, such as with random forest, my computer is returning better MAEs on the training sets by 1-2% compared to a benchmark that it should match exactly (my current understanding given I am using the same data and same random seed settings in R) and then performing in some cases up to 4% points worse in terms of MAE when tested on the test set. I am totally baffled as to how the difference could even exist in the first place let alone how to explain the consistent overfitting that seems to be occurring. Furthermore, the difference appears to exist on what I thought was a deterministic algorithm too, the logistic regression, even with cross validation, I thought this algorithm was deterministic, but I guess if you take a different fold pattern from using a different seed you would could get a different result, which I believe is how the randomness of the random forest algorithm is introduced also, which brings me back to the point with how am I getting different answers in the first place? The generic algorithms are not really stochastic in the way that they solve for their answers once the seed is set, I believe. Does anyone have any insight into and solutions to this problem? Thank you in advance for your time.

Cheers,

UPDATE 11-13: I checked my code for the nth time and still no errors found so I wrote the people who I am comparing my results with to see if there have been any changes on their end that would cause these discrepancies and will await their response, which they will hopefully supply.

UPDATE11-21: I hope I am not bothering anyone with the repeated updates to this post, but I did figure out what the problem was. It turns out that has you walk through their code you change from screen to screen and most of the time no state is saved from the prior screen/session. So when they rewrote their code to set the random seed at the top of the screen I though this was necessary because it was not carried over from the prior session, but it turns out the state was saved and this had the effect of setting the random seed twice: once before doing the initial split for training and testing sets, and then again setting it prior to doing the splits for the folds for cross validation on the training set. I am still new to machine learning, but I haven't seen multiple seed setting like this before outside the context of a seed being set within an algorithm call and then globally. I am curious if there is a real theoretical connection to the relative overfitting and relatively worse performance that was seen and not doing a second random seed setting? Also in general, is there any theory on setting random seeds that it might be helpful to be aware of? Even well explained rules of thumb will be considered useful.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.