Positive class in logistic regression

fcas80 · October 26, 2023, 2:18pm

In the germancredit dataset, the target variable creditability has values 1 = bad and 2 = good. The goal is to predict the bad. Wouldn't it make sense for logistic regression to set good=0 and bad=1? I don't think I see writers doing this. Thank you.

library(scorecard)
data("germancredit")
df <- data.frame(germancredit)
table(df$creditability)
str(df$creditability)

bad good
300 700
Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...

nirgrahamuk · October 26, 2023, 2:43pm

when it comes to outcomes; 'good' and 'bad' are more human interpretable when compared to '0' and '1'

fcas80 · October 26, 2023, 3:01pm

Thanks nirgahamuk. But for logisitic regression, don't I want to predict default with Pr(Y=1 | x)? So shouldn't I recode creditability as good=0 and bad=1?

And am I most interested in low False Positives, from the confusion matrix in Specificity = TN/(TN + FP)?

nirgrahamuk · October 26, 2023, 3:27pm

following the example from the scorecard documentation; they default to expecting you to tell them 'good'/'bad' which they do intend to map for your to 0/1 numbers.

library(scorecard)
data("germancredit")
dt_f = var_filter(germancredit, y="creditability")

table(germancredit$creditability)

 bad good 
 300  700

table(dt_f$creditability)

  0   1 
700 300

you can see in the documentation for their var_filter function: they ask you to idenify the "positive" class label

positive	
Value of positive class, Defaults to "bad|1".

fcas80 · October 26, 2023, 4:00pm

Ah, var_filter defaults to bad = 1.

Thank you.

Do you agree that I am most interested in low False Positives, from the confusion matrix in Specificity = TN/(TN + FP)?

nirgrahamuk · October 26, 2023, 4:11pm

I think it depends on what you are doing; for example a credit risk department might be focused on risk averse practices and so they might care the most about most accurately identifying bad credit risk so as to avoid that lending, so sensitivity will be a key metric.

Perhaps a department like pricing will take a more holistic view, but its most likely that they wont use raw statistical metrics in deciding key thresholds , but want to incorporate cost estimates. for each of TP/TN/FP/FN
i.e. the cost benefit to identifying a bad risk and avoid them on a loan of a certain size
vs the cost to misidentifying a good as a bad and foregoing that income.
Then the thresholds for setting good and bad can be done from a pricing perspective rather than a straight risk one; so its depending on your goals.

fcas80 · October 27, 2023, 6:07pm

The lender is most concerned about minimizing false negatives: predicted no default, but actually defaulted. Is sensitivity the only measure for this?

system · November 17, 2023, 6:07pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.