in Max's your APM book he propose a couple of solutions to tackle severe class imbalance such as: re-balancing the modelling dataset or using case weights etc. After reading this chapter and applying some of the solutions to a couple live examples I came across a couple of questions:

When working with the credit_data dataset from the recipes package I checked whether applying up-sampling (even though imbalance is not really severe) would result in a better model. Of course the recall-precision trade-off was much better in the up-sampled model because of a higher recall and that completely makes sense when the cost of approving bad credit is so much higher than rejection a good one, but I was very surprised to see so little effect on the AUC - the difference was almost negligible. In this case AUC is a measure or rank-ordering which is essentially what credit scoring is all about. Does that eventually mean that in credit/ fraud scoring applications up-sampling would not necessarily be so required because the relative ordering of good/ bad predictions is preserved even without up-sampling as the AUC suggestes?

Regardless of the answer to question 1), let's assume that up-sampling/ case weight was used which eventually distorts the posterior estimated probability of a model. For example, if up-sampling was applied the average default rate of the original set was 5% (understood as count_bad / n()), the estimated default rate (understood as the average estimated default probability) for the training set would be much higher e.g. 15%. That also means that any future estimate of the model would not reflect the true, imbalance nature of the original training set, correct? In this case, is there a 'statistically' valid way of adjusting estimated posterior probabilites of future predictions?

That has been my experience with any type of subsampling for class imbalances; the default cutoff is more reasonable and the AUC is often equivalent to the original model.

No, I wouldn't say that. There is an effect of the sampling methods of the model; it isn't just the same model with different weights/coefficients (see footnote).

There is a positive effect on the calibration of the model in the sense that the classification (i.e. posterior) probabilities do not have such pathological distributions.
I tried to demonstrate this on slide 14 of the classification notes found here.

Yes to the first question (and yes is a good thing). Without rebalancing, it wouldn't be uncommon to have your mostly likely event to have a class probability less than 60% (or lower).

I wouldn't readjust the class probabilities further (since that is what the subsampling is accomplishing). If you don't subsample, you could use another recalibration method (using Bayes theorem or monotonic regression); I tend to use subsampling to solve this issue though.

[footnote] The exception to this is ordinary logistic regression. The class imbalance drives the intercept to an extreme and rebalancing the data only effects the intercept parameter (and all of the standard errors).

Thanks for your answer! I agree with the statement that the posterior distributions have less pathological distributions (one could say are better calibrated), however, they are not 'properly' calibrated from a business practitioner point of view. If you applied a model with and without up-sampling to the same data, in the case without sampling a given observation would have an estimated probability of e.g. 3% and in the sampling case e.g. 15%. But this doesn't mean that the likelihood is 5x higher, therefore such estimated model probabilities are not directly interpretable and need to be adjusted post model training.

I was looking recently into this topic and even found a filled issue in caret where it is being discussed. Someone suggests there:

A possible solution could be scaling the predicted probabilities for upsample/downsampled models to the average unscaled value.

and that is also something that I came across somewhere else. That would also mean that for such a model used in production every probability estimation will be adjusted by that constant (5x in the case above). I think it's also related to what this article describes, but I'm not entirely sure yet how to put those things together.

Any insights and structured methodology of final model calibration after sampling would be much appreciated!

After investigating a bit more, I came across XGBoost documentation website where it says:

Handle Imbalanced Dataset
For common cases such as ads clickthrough log, the dataset is extremely imbalanced. This can affect the training of xgboost model, and there are two ways to improve it.

If you care only about the ranking order (AUC) of your prediction:
Balance the positive and negative weights, via scale_pos_weight
Use AUC for evaluation If you care about predicting the right probability:
In such a case, you cannot re-balance the dataset
In such a case, set parameter max_delta_step to a finite number (say 1) will help convergence

Where it says that if we're interested in predicting the true probability estimate the data set cannot be rebalanced in any way (sampling or case-weights). That is a similar conclusion that I have now after some reading since Platt scaling that I mentioned in my previous post has a completely different purpose. So my conclusion is that in any application where getting the right probability is crucial any rebalancing would not be recommended because it doesn't improve rank-ordering and could distort the true probability estimate. Performing calibration could further improve those probability estimated but they would not reflect the original levels before rebalancing was applied.

I disagree although I don't have any real literature to quote in either direction.

From a practical perspective, it would be impossible for me to show a model to someone and say

"we think that the event is really an event if its corresponding class probability is greater than 2.5%."

The class probability distribution tend to pathological in this way when there is a moderate imbalance or worse. I've found that rebalancing solves that issue.

Now, it would be reasonable to say "well, if most of the data is not an event, the posterior probability should look this way" and, in a sense, that is completely correct. However, it doesn't make for a good predictive model because the likelihood will have to be extraordinarily powerful to overcome the issue of an extreme prior.

My former boss and I had this discussion all of the time because he would only view the model results from a very strict Bayesian point of view (and would not even want me to work on those projects). There is a difference between a good predictive class probability and a strict posterior that is consistent with the underlying event rate (since the latter may not ever be useful).

I'm not right (nor am I wrong); it is a matter of what you are trying to attain.