Thanks @arunchandra, that helps a lot. I've taken your data and the code above to put it all in one reprex
. I've renamed the initial model saturated
because it contains all variates, and I've used glm
rather than logistf
because it's simpler and the principles are the same.
The first thing to notice is the p-values
in the summary, which indicates the degree to which a variate contributes to the model. Using the conventional \alpha =0.05 not all are significant.
The reduced1
model eliminates those. In reduced2
only a single variate is used. One is added back in reduced3
.
We can see the differences in the summaries, the log likelihoods, odds ratios and goodness of fits.
The log likelihood
parameter returned by logLik
is mysterious in its interpretation (it makes calculations more tractable). The odr
function produces odds ratios
, a parameter for E(Y | X), the expectation of Y given X. Consider odds ratios of -0.5, 1, and 2: Y given X is half as likely, equally likely and twice as likely.
To your original question, it is, indeed possible to convert odds ratios to probabilities, using this snippet
par(las=1,bty="l")
curve(qlogis,from=0.01,to=0.99,
xlab="probability",ylab="log-odds")
abline(h=0,lty=2)
Can you see the difficulty that this could pose?
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(logistf))
suppressPackageStartupMessages(library(ResourceSelection))
# function to compute odds ratio
odr <- function(x) {
exp(cbind(OR = coef(x), confint(x)))
}
modeldata4 <- tibble::tribble(
~orderid, ~naomiclassi, ~noofcustomercontacts, ~numberofojcontacts, ~shippedlate, ~deliveredlate, ~averagetimebetweencommunication, ~ordervaluebucket, ~atozclaim,
4124763, 1, 2, 14, 1, 1, 3.366666, 8, 1,
4122313, 1, 2, 2, 1, 1, 13.933333, 8, 1,
4120369, 1, 2, 4, 1, 1, 0, 10, 1,
4106030, 1, 2, 3, 0, 1, 5.683333, 9, 1,
4104061, 1, 0, 3, 1, 0, 0, 11, 1,
4102438, 1, 9, 11, 1, 0, 3.758333, 12, 1,
4100101, 1, 48, 34, 1, 1, 4.576922, 10, 1,
4099436, 1, 22, 22, 1, 1, 3.252777, 9, 1,
4096928, 1, 8, 16, 1, 1, 2.949999, 10, 1,
4095858, 1, 0, 1, 1, 0, 0, 10, 1,
4094935, 1, 6, 12, 1, 0, 32.920833, 11, 1,
4093557, 1, 2, 10, 1, 1, 4.15, 10, 1,
4091346, 1, 2, 2, 1, 1, 0, 11, 1,
4091022, 1, 11, 14, 1, 0, 5.881481, 10, 1,
4090255, 0, 1, 6, 1, 0, 0.383333, 10, 1,
4090235, 1, 1, 3, 1, 0, 6.533333, 9, 1,
4090107, 1, 4, 10, 1, 1, 8.7, 9, 1,
4090098, 1, 2, 3, 1, 0, 0.016666, 9, 1,
4089524, 1, 8, 16, 1, 1, 0.358333, 8, 1,
4088698, 1, 4, 18, 1, 1, 9.674999, 8, 1,
4179262, 1, 2, 4, 1, 0, 1.05, 9, 0,
4179240, 1, 0, 4, 1, 0, 0, 8, 0,
4179195, 1, 1, 3, 1, 0, 0.25, 9, 0,
4179101, 1, 0, 1, 1, 0, 0, 10, 0,
4179088, 1, 0, 0, 1, 0, 0, 4, 0,
4179059, 1, 0, 1, 1, 0, 0, 5, 0,
4178980, 1, 2, 0, 1, 0, 0, 9, 0,
4178927, 0, 1, 6, 1, 0, 4.883333, 9, 0,
4178887, 0, 0, 0, 1, 0, 0, 9, 0,
4178884, 0, 1, 4, 1, 0, 10.8, 9, 0,
4178855, 1, 0, 0, 1, 0, 0, 9, 0,
4178845, 0, 0, 0, 1, 0, 0, 10, 0,
4178767, 0, 0, 0, 1, 0, 0, 9, 0,
4178695, 1, 0, 0, 1, 0, 0, 7, 0,
4178691, 0, 0, 0, 1, 0, 0, 9, 0,
4178552, 1, 0, 0, 1, 0, 0, 10, 0,
4178441, 1, 0, 0, 1, 0, 0, 10, 0,
4178363, 1, 0, 0, 1, 0, 0, 7, 0,
4178341, 1, 1, 2, 1, 0, 0.566666, 9, 0,
4178230, 1, 0, 3, 1, 0, 0, 7, 0,
4178221, 1, 0, 3, 1, 0, 0, 7, 0,
4178213, 1, 0, 0, 1, 0, 0, 8, 0,
4178123, 1, 0, 0, 1, 0, 0, 9, 0,
4178020, 1, 0, 0, 1, 0, 0, 9, 0,
4177992, 1, 0, 0, 1, 0, 0, 8, 0,
4177950, 0, 0, 0, 1, 0, 0, 9, 0,
4177876, 1, 0, 0, 1, 0, 0, 9, 0,
4177873, 1, 0, 0, 1, 0, 0, 9, 0,
4177822, 1, 0, 0, 1, 0, 0, 8, 0,
4177765, 1, 1, 0, 1, 0, 0, 8, 0,
4177634, 1, 0, 0, 1, 0, 0, 8, 0,
4177586, 1, 0, 0, 1, 0, 0, 10, 0,
4177585, 1, 0, 3, 1, 0, 0, 8, 0,
4177566, 1, 3, 4, 1, 0, 4.866666, 8, 0,
4177482, 1, 0, 0, 1, 0, 0, 9, 0,
4177480, 1, 0, 1, 1, 0, 0, 11, 0,
4177469, 1, 0, 0, 1, 0, 0, 9, 0,
4177461, 1, 0, 0, 1, 0, 0, 9, 0,
4177358, 1, 0, 0, 1, 0, 0, 8, 0,
4177312, 1, 0, 1, 1, 0, 0, 11, 0,
4177256, 1, 0, 0, 1, 0, 0, 9, 0,
4177212, 1, 0, 0, 1, 0, 0, 8, 0,
4177195, 1, 0, 0, 1, 0, 0, 10, 0,
4177159, 1, 0, 0, 1, 0, 0, 7, 0,
4177122, 0, 0, 0, 1, 0, 0, 9, 0,
4177055, 0, 0, 0, 1, 0, 0, 8, 0,
4176867, 1, 0, 0, 1, 0, 0, 8, 0,
4176845, 1, 1, 2, 1, 0, 1.45, 7, 0,
4176819, 1, 3, 1, 1, 0, 5.083333, 10, 0,
4176814, 1, 0, 0, 1, 0, 0, 9, 0
)
# saturated model
saturated <- glm(atozclaim ~ naomiclassi + noofcustomercontacts + numberofojcontacts + shippedlate + deliveredlate + averagetimebetweencommunication + ordervaluebucket, data=modeldata4)
summary(saturated)
#>
#> Call:
#> glm(formula = atozclaim ~ naomiclassi + noofcustomercontacts +
#> numberofojcontacts + shippedlate + deliveredlate + averagetimebetweencommunication +
#> ordervaluebucket, data = modeldata4)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.30837 -0.13333 -0.05838 0.00075 0.81595
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.453560 0.372006 -1.219 0.227376
#> naomiclassi 0.089934 0.092032 0.977 0.332267
#> noofcustomercontacts -0.020982 0.010089 -2.080 0.041705 *
#> numberofojcontacts 0.041136 0.013238 3.107 0.002847 **
#> shippedlate -0.275705 0.289239 -0.953 0.344186
#> deliveredlate 0.470521 0.125943 3.736 0.000409 ***
#> averagetimebetweencommunication 0.012304 0.007665 1.605 0.113503
#> ordervaluebucket 0.082415 0.024643 3.344 0.001403 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.0691428)
#>
#> Null deviance: 14.2857 on 69 degrees of freedom
#> Residual deviance: 4.2869 on 62 degrees of freedom
#> AIC: 21.145
#>
#> Number of Fisher Scoring iterations: 2
logLik(saturated)
#> 'log Lik.' -1.572722 (df=9)
odr(saturated)
#> Waiting for profiling to be done...
#> OR 2.5 % 97.5 %
#> (Intercept) 0.6353622 0.3064570 1.3172653
#> naomiclassi 1.0941020 0.9135241 1.3103751
#> noofcustomercontacts 0.9792367 0.9600625 0.9987938
#> numberofojcontacts 1.0419935 1.0153056 1.0693829
#> shippedlate 0.7590367 0.4305879 1.3380235
#> deliveredlate 1.6008279 1.2506670 2.0490266
#> averagetimebetweencommunication 1.0123804 0.9972855 1.0277037
#> ordervaluebucket 1.0859060 1.0347043 1.1396414
saturated_gof <- hoslem.test(saturated$y, fitted(saturated), g=10)
saturated_gof
#>
#> Hosmer and Lemeshow goodness of fit (GOF) test
#>
#> data: saturated$y, fitted(saturated)
#> X-squared = 2.37, df = 8, p-value = 0.9675
saturated_exp_ob <- cbind(saturated_gof$observed,saturated_gof$expected)
saturated_exp_ob
#> y0 y1 yhat0 yhat1
#> [-0.31,0.0111] 7 0 7.7540233 -0.75402326
#> (0.0111,0.02] 13 0 12.7810589 0.21894112
#> (0.02,0.0608] 1 0 0.9395629 0.06043709
#> (0.0608,0.102] 13 0 11.7591515 1.24084846
#> (0.102,0.157] 1 0 0.8566066 0.14339335
#> (0.157,0.185] 7 1 6.5400318 1.45996815
#> (0.185,0.286] 4 2 4.5915284 1.40847157
#> (0.286,0.614] 4 3 4.4776175 2.52238252
#> (0.614,1.01] 0 7 1.0895950 5.91040496
#> (1.01,1.27] 0 7 -0.7891760 7.78917603
# reduced model with all coefficients having p-values < 0.05
reduced1 <- glm(atozclaim ~ noofcustomercontacts + numberofojcontacts + deliveredlate + ordervaluebucket, data=modeldata4)
summary(reduced1)
#>
#> Call:
#> glm(formula = atozclaim ~ noofcustomercontacts + numberofojcontacts +
#> deliveredlate + ordervaluebucket, data = modeldata4)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.34941 -0.14438 -0.09881 0.01671 0.81119
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.67707 0.21841 -3.100 0.002860 **
#> noofcustomercontacts -0.02354 0.00978 -2.406 0.018960 *
#> numberofojcontacts 0.04569 0.01196 3.818 0.000302 ***
#> deliveredlate 0.51981 0.11654 4.460 3.32e-05 ***
#> ordervaluebucket 0.08621 0.02463 3.500 0.000846 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.0712616)
#>
#> Null deviance: 14.286 on 69 degrees of freedom
#> Residual deviance: 4.632 on 65 degrees of freedom
#> AIC: 20.566
#>
#> Number of Fisher Scoring iterations: 2
odr(reduced1)
#> Waiting for profiling to be done...
#> OR 2.5 % 97.5 %
#> (Intercept) 0.5081026 0.3311614 0.7795845
#> noofcustomercontacts 0.9767386 0.9581936 0.9956426
#> numberofojcontacts 1.0467484 1.0224862 1.0715863
#> deliveredlate 1.6817048 1.3382950 2.1132343
#> ordervaluebucket 1.0900344 1.0386622 1.1439475
logLik(reduced1)
#> 'log Lik.' -4.282998 (df=6)
reduced1_gof <- hoslem.test(reduced1$y, fitted(reduced1), g=10)
reduced1_gof
#>
#> Hosmer and Lemeshow goodness of fit (GOF) test
#>
#> data: reduced1$y, fitted(reduced1)
#> X-squared = 6.4281, df = 8, p-value = 0.5994
reduced1_exp_ob <- cbind(reduced1_gof$observed,reduced1_gof$expected)
reduced1_exp_ob
#> y0 y1 yhat0 yhat1
#> [-0.332,0.0108] 7 0 7.770093 -0.7700926
#> (0.0108,0.0126] 8 0 7.899182 0.1008180
#> (0.0126,0.0988] 18 0 16.339170 1.6608295
#> (0.0988,0.155] 2 0 1.725584 0.2744163
#> (0.155,0.187] 7 0 5.748143 1.2518573
#> (0.187,0.242] 4 3 5.495243 1.5047566
#> (0.242,0.568] 4 3 4.349050 2.6509499
#> (0.568,0.991] 0 7 1.730992 5.2690081
#> (0.991,1.26] 0 7 -1.057457 8.0574569
# minimal model with only one variate
reduced2 <- glm(atozclaim ~ deliveredlate, data=modeldata4)
summary(reduced2)
#>
#> Call:
#> glm(formula = atozclaim ~ deliveredlate, data = modeldata4)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.1379 -0.1379 -0.1379 0.0000 0.8621
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.13793 0.04182 3.298 0.00155 **
#> deliveredlate 0.86207 0.10100 8.536 2.34e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.1014199)
#>
#> Null deviance: 14.2857 on 69 degrees of freedom
#> Residual deviance: 6.8966 on 68 degrees of freedom
#> AIC: 42.428
#>
#> Number of Fisher Scoring iterations: 2
logLik(reduced2)
#> 'log Lik.' -18.21412 (df=3)
odr(reduced2)
#> Waiting for profiling to be done...
#> OR 2.5 % 97.5 %
#> (Intercept) 1.147896 1.057568 1.245939
#> deliveredlate 2.368055 1.942779 2.886424
reduced2_gof <- hoslem.test(reduced2$y, fitted(reduced2), g=10)
reduced2_gof
#>
#> Hosmer and Lemeshow goodness of fit (GOF) test
#>
#> data: reduced2$y, fitted(reduced2)
#> X-squared = 2.4865e-29, df = 8, p-value = 1
reduced2_exp_ob <- cbind(reduced2_gof$observed,reduced2_gof$expected)
reduced2_exp_ob
#> y0 y1 yhat0 yhat1
#> [0.138,1] 50 20 50 20
# reduced model with two variables with lowest p-values
reduced3 <- glm(atozclaim ~ deliveredlate + ordervaluebucket, data=modeldata4)
summary(reduced3)
#>
#> Call:
#> glm(formula = atozclaim ~ deliveredlate + ordervaluebucket, data = modeldata4)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.35018 -0.15783 -0.07090 0.03453 0.84217
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.70777 0.23767 -2.978 0.004037 **
#> deliveredlate 0.82614 0.09364 8.822 7.97e-13 ***
#> ordervaluebucket 0.09618 0.02667 3.606 0.000593 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.08620277)
#>
#> Null deviance: 14.2857 on 69 degrees of freedom
#> Residual deviance: 5.7756 on 67 degrees of freedom
#> AIC: 32.012
#>
#> Number of Fisher Scoring iterations: 2
logLik(reduced3)
#> 'log Lik.' -12.00575 (df=4)
odr(reduced3)
#> Waiting for profiling to be done...
#> OR 2.5 % 97.5 %
#> (Intercept) 0.4927427 0.3092552 0.7850972
#> deliveredlate 2.2844849 1.9014236 2.7447178
#> ordervaluebucket 1.1009545 1.0448815 1.1600367
reduced3_gof <- hoslem.test(reduced3$y, fitted(reduced3), g=10)
reduced3_gof
#>
#> Hosmer and Lemeshow goodness of fit (GOF) test
#>
#> data: reduced3$y, fitted(reduced3)
#> X-squared = 2.2295, df = 8, p-value = 0.9732
reduced3_exp_ob <- cbind(reduced3_gof$observed,reduced3_gof$expected)
reduced3_exp_ob
#> y0 y1 yhat0 yhat1
#> [-0.323,-0.0345] 8 0 8.7570900 -0.7570900
#> (-0.0345,0.0617] 12 0 11.2601726 0.7398274
#> (0.0617,0.158] 21 2 19.3699137 3.6300863
#> (0.158,0.254] 7 3 7.4599260 2.5400740
#> (0.254,0.35] 2 2 2.5992602 1.4007398
#> (0.35,0.984] 0 8 1.0505549 6.9494451
#> (0.984,1.18] 0 5 -0.4969174 5.4969174
Created on 2019-12-31 by the reprex package (v0.3.0)