I am fitting a logistic regression model for my thesis and I am looking at univariable models to determine significant predictors that I can include in the multivariable model for prediction. The output for one of the variables (wealth quintile) is as shown below; Should I conclude that the variable is a significant predictor or not?
The best way of getting a good answer to your question is to provide us a more in depth explanation on the data you're working with, your goals and the reason for using regression. Interpretation of machine learning model parameters is a tricky business and should always be done in the full context of the data and question at hand.
Apart from some more details on the issue itself, it's always a good idea as well to provide a reprex, where you create some code that consists of a minimal dataset and the code you like to run.
Hi, and welcome to the community and, as well, to the wonderfully wacky world of logistic regression. I'm in the middle of unpacking, so I can just give you the view from 40,000 feet tonight.
The typical logistic regression model is in the form
glm(y \tilde{} x_i + ... x_n)
There are four steps in evaluating a logistic model.
Selection of the parameters. There are several ways to do this. One is to use a saturated model with all of the available independent variables. For a given \alpha, the x terms that have a p-value greater than \alpha are successively discarded from the model.
This gives an indication whether observing x makes observing y more likely (OR > 1), less likely (OR < 1) or equally likely (OR = 1), and allows testing whether the OR falls within a given two-sided confidence interval.
Next comes a goodness of fit test, such as Hosmer-Lemeshow goodness of fit, which has a null hypothesis H_0, that the fit is poor; accordingly a high p-value is evidence of a good fit. The generalhoslem package will produce a test statistic with the hoslem.test function. It also provides tables of expected and observed frequencies.
If the stars align, the final step of model diagnostics may not be needed.
The standard text is Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression, 3rd Edition. 2013. New York, USA: John Wiley and Sons.
Since wealthquintile is encoded as an ordered factor, you get a set of polynomial variables generated from that one column. I don't think that this is the best idea and tend to convert these variables to unordered factors (but that's just my preference).
Since you have multiple polynomials, you would have to conduct an overall ANOVA with and without this predictor. Here an example using a different data set
library(broom)
set.seed(2424)
dat <- data.frame(
y = factor(rep(c("yes", "no"), 200)),
x = ordered(sample(letters[1:4], 400, replace = TRUE))
)
lr_mod <- glm(y ~ x, data = dat, family = binomial())
# 4 level factor => 3 polynomial variables
tidy(lr_mod)
#> # A tibble: 4 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.00334 0.101 0.0332 0.973
#> 2 x.L -0.148 0.206 -0.718 0.473
#> 3 x.Q 0.239 0.201 1.19 0.234
#> 4 x.C 0.0968 0.196 0.494 0.622
# 3 df test for the overall effect of x
anova(lr_mod, test = "LRT")
#> Analysis of Deviance Table
#>
#> Model: binomial, link: logit
#>
#> Response: y
#>
#> Terms added sequentially (first to last)
#>
#>
#> Df Deviance Resid. Df Resid. Dev Pr(>Chi)
#> NULL 399 554.52
#> x 3 2.2333 396 552.28 0.5254