Need help with interpreting a model parameters

ernestkirui2010 · October 8, 2019, 4:30pm

I am fitting a logistic regression model for my thesis and I am looking at univariable models to determine significant predictors that I can include in the multivariable model for prediction. The output for one of the variables (wealth quintile) is as shown below; Should I conclude that the variable is a significant predictor or not?

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -0.28478    0.04200  -6.781 1.19e-11 ***
wealthquintile.L -0.62622    0.09939  -6.301 2.96e-10 ***
wealthquintile.Q -0.23414    0.09694  -2.415   0.0157 *  
wealthquintile.C -0.10522    0.08983  -1.171   0.2415    
wealthquintile^4 -0.11141    0.08904  -1.251   0.2109

pieterjanvc · October 9, 2019, 12:28am

Hi,

Welcome to the RStudio community!

The best way of getting a good answer to your question is to provide us a more in depth explanation on the data you're working with, your goals and the reason for using regression. Interpretation of machine learning model parameters is a tricky business and should always be done in the full context of the data and question at hand.

Apart from some more details on the issue itself, it's always a good idea as well to provide a reprex, where you create some code that consists of a minimal dataset and the code you like to run.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Kind regards,
PJ

technocrat · October 9, 2019, 3:58am

Hi, and welcome to the community and, as well, to the wonderfully wacky world of logistic regression. I'm in the middle of unpacking, so I can just give you the view from 40,000 feet tonight.

The typical logistic regression model is in the form

glm(y \tilde{} x_i + ... x_n)

There are four steps in evaluating a logistic model.

Selection of the parameters. There are several ways to do this. One is to use a saturated model with all of the available independent variables. For a given \alpha, the x terms that have a p-value greater than \alpha are successively discarded from the model.
Calculation of odds ratio.

odr <- function(x) {
    exp(cbind(OR = coef(x), confint(x)))
}

This gives an indication whether observing x makes observing y more likely (OR > 1), less likely (OR < 1) or equally likely (OR = 1), and allows testing whether the OR falls within a given two-sided confidence interval.

Next comes a goodness of fit test, such as Hosmer-Lemeshow goodness of fit, which has a null hypothesis H_0, that the fit is poor; accordingly a high p-value is evidence of a good fit. The generalhoslem package will produce a test statistic with the hoslem.test function. It also provides tables of expected and observed frequencies.
If the stars align, the final step of model diagnostics may not be needed.

The standard text is Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression, 3rd Edition. 2013. New York, USA: John Wiley and Sons.

Max · October 10, 2019, 10:09am

Since wealthquintile is encoded as an ordered factor, you get a set of polynomial variables generated from that one column. I don't think that this is the best idea and tend to convert these variables to unordered factors (but that's just my preference).

Since you have multiple polynomials, you would have to conduct an overall ANOVA with and without this predictor. Here an example using a different data set

library(broom)

set.seed(2424)
dat <- data.frame(
  y = factor(rep(c("yes", "no"), 200)),
  x = ordered(sample(letters[1:4], 400, replace = TRUE))
)

lr_mod <- glm(y ~ x, data = dat, family = binomial())

# 4 level factor => 3 polynomial variables
tidy(lr_mod)
#> # A tibble: 4 x 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)  0.00334     0.101    0.0332   0.973
#> 2 x.L         -0.148       0.206   -0.718    0.473
#> 3 x.Q          0.239       0.201    1.19     0.234
#> 4 x.C          0.0968      0.196    0.494    0.622

# 3 df test for the overall effect of x
anova(lr_mod, test = "LRT")
#> Analysis of Deviance Table
#> 
#> Model: binomial, link: logit
#> 
#> Response: y
#> 
#> Terms added sequentially (first to last)
#> 
#> 
#>      Df Deviance Resid. Df Resid. Dev Pr(>Chi)
#> NULL                   399     554.52         
#> x     3   2.2333       396     552.28   0.5254

^{Created on 2019-10-10 by the reprex package (v0.3.0)}

system · October 31, 2019, 10:09am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.