Hi, and welcome!
You'll attract more and better answers with a reproducible example, called a reprex. The code without either actual or representative data makes specific comments difficult.
I can, however, offer some general guidance.
Logistic regression bears some underlying similarities to linear regression, but the differences are considerable. If you will be doing much in this area, an essential resources is Applied Logistic Regression 3rd Edition by David W. Hosmer Jr., Stanley Lemeshow and Rodney X. Sturdivant (2009) , the standard text. Unfortunately, it is software agnostic and doesn't include examples in R
or any other language. I'm working on an R
companion at my blog, but it will be a few weeks until Chapter 1 is complete.
Consider a data frame of binary values and a fully saturated glm
model
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(readr))
binary_data <- read_csv("https://tuva.s3-us-west-2.amazonaws.com/binary_data.csv")
#> Warning: Missing column names filled in: 'X1' [1]
#> Warning: Duplicated column names deduplicated: 'X1' => 'X1_1' [3]
#> Parsed with column specification:
#> cols(
#> X1 = col_double(),
#> Y = col_double(),
#> X1_1 = col_double(),
#> X2 = col_double(),
#> X3 = col_double(),
#> X4 = col_double(),
#> X5 = col_double(),
#> X6 = col_double(),
#> X7 = col_double(),
#> X8 = col_double(),
#> X9 = col_double(),
#> X10 = col_double(),
#> X11 = col_double(),
#> X12 = col_double(),
#> X13 = col_double(),
#> X14 = col_double(),
#> X15 = col_double(),
#> X16 = col_double()
#> )
# remove row names and rename first independent variable
binary_data <- binary_data %>% select(-X1) %>% rename(X1 = X1_1)
head(binary_data)
#> # A tibble: 6 x 17
#> Y X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 0 0 0 0 0 1 0 0 0 0
#> 2 1 0 0 0 0 0 0 0 1 0 0 0 0
#> 3 1 0 0 0 0 0 0 0 0 0 0 0 0
#> 4 1 0 0 0 0 0 0 0 0 0 0 0 0
#> 5 1 0 0 0 0 0 1 0 0 0 0 0 0
#> 6 1 0 0 1 0 0 0 0 0 0 0 0 0
#> # … with 4 more variables: X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>
# create fully saturated model
fit <- glm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X14 + X15 + X16, data = binary_data, family = binomial(link = "logit"))
summary(fit)
#>
#> Call:
#> glm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 +
#> X10 + X11 + X12 + X13 + X14 + X15 + X16, family = binomial(link = "logit"),
#> data = binary_data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.8522 0.2025 0.2916 0.3122 1.9870
#>
#> Coefficients: (3 not defined because of singularities)
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.5533 0.2347 6.619 3.61e-11 ***
#> X1 NA NA NA NA
#> X2 -2.8596 0.2961 -9.659 < 2e-16 ***
#> X3 1.4435 0.2471 5.843 5.13e-09 ***
#> X4 -2.1107 0.2599 -8.121 4.63e-16 ***
#> X5 12.0127 239.4433 0.050 0.9600
#> X6 0.7076 0.3098 2.284 0.0224 *
#> X7 0.3745 0.4453 0.841 0.4002
#> X8 1.5831 0.2570 6.160 7.29e-10 ***
#> X9 2.3236 0.3343 6.950 3.65e-12 ***
#> X10 NA NA NA NA
#> X11 -2.3006 0.2963 -7.763 8.28e-15 ***
#> X12 -3.3779 0.5360 -6.302 2.94e-10 ***
#> X13 0.3780 0.2677 1.412 0.1580
#> X14 -2.4696 0.3938 -6.271 3.58e-10 ***
#> X15 2.4970 0.3740 6.677 2.44e-11 ***
#> X16 NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 6153.2 on 9348 degrees of freedom
#> Residual deviance: 4167.3 on 9335 degrees of freedom
#> AIC: 4195.3
#>
#> Number of Fisher Scoring iterations: 12
The first thing to note is the change between Null deviance and Residual deviance at the bottom of the summary. This plays a similar role to the F Statistic in lm
. The decrease shows that the model explains more of the variation.
The p-values
assess the usefulness or not of the coefficients. To begin, some of the coefficients are NA
. It turns out that all of the values are 0; and it should be obvious that they should be eliminated from the model because they tell us literally nothing.
Some of the remaining coefficients fail the significance test at the traditional \alpha = 0.05 level. We'll eliminate those, along with the NA coefficients and rerun the model:
# remove NA and p-values < $\alpha$
fit2 <- glm(Y ~ X2 + X3 + X4 + X6 + X8 + X9 + X11 + X12 + X14 + X15, data = binary_data, family = binomial(link = "logit"))
summary(fit2)
#>
#> Call:
#> glm(formula = Y ~ X2 + X3 + X4 + X6 + X8 + X9 + X11 + X12 + X14 +
#> X15, family = binomial(link = "logit"), data = binary_data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.8522 0.2025 0.2916 0.3122 1.9870
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 1.8663 0.1080 17.281 < 2e-16 ***
#> X2 -3.1726 0.2104 -15.082 < 2e-16 ***
#> X3 1.1305 0.1328 8.515 < 2e-16 ***
#> X4 -2.4237 0.1554 -15.598 < 2e-16 ***
#> X6 0.3946 0.2293 1.721 0.0852 .
#> X8 1.2701 0.1505 8.440 < 2e-16 ***
#> X9 2.0106 0.2615 7.690 1.47e-14 ***
#> X11 -2.6136 0.2107 -12.402 < 2e-16 ***
#> X12 -3.6909 0.4939 -7.473 7.83e-14 ***
#> X14 -2.7826 0.3342 -8.327 < 2e-16 ***
#> X15 2.1840 0.3105 7.033 2.01e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 6153.2 on 9348 degrees of freedom
#> Residual deviance: 4170.7 on 9338 degrees of freedom
#> AIC: 4192.7
#>
#> Number of Fisher Scoring iterations: 6
Created on 2019-12-17 by the reprex package (v0.3.0)
We would eliminate X6
in like manner, leading, perhaps to further rounds of elimination.
But let's skip over those to find the missing measure of probability. Strangely, it's not in the glm
object, but given by
logLik(fit)
'log Lik.' -2084 (df=14)
What's this? The log likelihood
. How to interpret it?
Usually, this is done through conversion to an odds ratio.
odr <- function(x) {
exp(cbind(OR = coef(x), confint(x)))
}
odr(fit2)
#> Waiting for profiling to be done...
#> OR 2.5 % 97.5 %
#> (Intercept) 6.46464646 5.258904976 8.03440121
#> X2 0.04189453 0.027435667 0.06266866
#> X3 3.09726563 2.381064505 4.00914584
#> X4 0.08859375 0.065075959 0.11970440
#> X6 1.48385417 0.959876953 2.36568686
#> X8 3.56106908 2.651198481 4.78618345
#> X9 7.46796870 4.587235443 12.86209781
#> X11 0.07327303 0.048110916 0.11004687
#> X12 0.02494960 0.008361653 0.06034486
#> X14 0.06187500 0.031190057 0.11665747
#> X15 8.88164006 5.031918109 17.18366500
Created on 2019-12-17 by the reprex package (v0.3.0)
An odds ratio equal to 1 translates Y s equally likely with or without X; OR > 1, say 2, means more, twice, as likely. OR = 0.5 = less, half as likely.
For goodness of fit, there's the Hosmer-Lemeshow goodness of fit test , which has a null hypothesis H_0, that the fit is poor; accordingly a high p-value is evidence of a good fit. The results are based on dividing the probabilities for the response variable, Y into deciles and then to examine the expected and actual results against their estimates, as shown in the following two tables. Note that the interpretation of p-value
is backwards from what you may be used to.
There's much more, but I hope this will get you oriented.