Sample weighting in `glm()` logit model

Hi all,

I'm trying to run a simple logit regression where I estimate the (binary) odds of cohabitation with family (cohabit) using a person's (binary) gender (is_female). The unweighted logit regression looks great, but I get issues when I try to add weights - I get an unfeasibly huge coefficient and a warning that "fitted probabilities numerically 0 or 1 occurred." I thought the problem might be due to categories with 0 observations, but I've produced a replicable example below with nonzero counts in each category, and the issue still occurs.

The reason I'm weighting in the first place is that my data comes from a survey. Each weight is an integer that indicates the number of people this one respondent represents. So, one potentially equivalent workaround I could do is create an artificial data set where I repeat each row the number of times shown by the weight column. But, given as my data are already several gigabytes, and weights are typically around 100, this would significantly bloat my computations.

Thanks in advance!

library(dplyr)

# Sample data
data <- structure(list(
  cohabit = c(FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, 
              FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
              FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE),
  is_female = c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
                FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, 
                TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE),
  weight = c(106, 157, 131, 114, 163, 191, 159, 167, 37, 62, 60, 63, 70, 78, 294, 
             253, 39, 79, 74, 68, 65, 68, 93, 102, 87, 252, 116, 382, 312, 48)),
  class = c("tbl_df", "tbl", "data.frame"),
  row.names = c(NA, -30L)
)

# Demonstrate that there nonzero counts of observations in each unique combination 
# of `is_female` and `cohabit` binary variables
data |>
  group_by(is_female, cohabit) |>
  summarize(
    total_weight = sum(weight),
    count = n(),
    .groups = "drop"
  )

The output:

  is_female cohabit total_weight count
  <lgl>     <lgl>          <dbl> <int>
1 FALSE     FALSE           1353    10
2 FALSE     TRUE             412     3
3 TRUE      FALSE           1742    12
4 TRUE      TRUE             383     5

Now I'll show what happens when I try running the two types of regression.

# Run the unweighted logit regression - looks okay
glm(cohabit ~ is_female, family = "binomial", data = data)

The output:

Call:  glm(formula = cohabit ~ is_female, family = "binomial", data = data)

Coefficients:
  (Intercept)  is_femaleTRUE  
      -1.2040         0.3285  

Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
Null Deviance:	    34.79 
Residual Deviance: 34.64 	AIC: 38.64

Now I run the messed up weighted model.

# Run the weighted logit - something's wrong!
glm(cohabit ~ is_female, family = "binomial", weight = weight, data = data)
Call:  glm(formula = cohabit ~ is_female, family = "binomial", data = data, 
    weights = weight)

Coefficients:
  (Intercept)  is_femaleTRUE  
   -4.414e+14      4.414e+14  

Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
Null Deviance:	    3940 
Residual Deviance: 31700 	AIC: 31710
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 

The coefficient on is_female is humongous, and the "algorithm did not converge". In addition "fitted probabilities numerically 0 or 1 occurred" - an issue which surprises me, given that I show there are nonzero counts in each combination of is_female and cohabit, in the table above.

If you have survey data, you should not only account for the weights but the design. I recommend using the survey package for this.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(survey)
#> Loading required package: grid
#> Loading required package: Matrix
#> Loading required package: survival
#> 
#> Attaching package: 'survey'
#> The following object is masked from 'package:graphics':
#> 
#>     dotchart

# Sample data
data <- structure(list(
  cohabit = c(FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, 
              FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
              FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE),
  is_female = c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
                FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, 
                TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE),
  weight = c(106, 157, 131, 114, 163, 191, 159, 167, 37, 62, 60, 63, 70, 78, 294, 
             253, 39, 79, 74, 68, 65, 68, 93, 102, 87, 252, 116, 382, 312, 48)),
  class = c("tbl_df", "tbl", "data.frame"),
  row.names = c(NA, -30L)
)

# Create survey design object

desobj <- svydesign(data = data, ids = ~1, weights = ~weight)

svyglm(cohabit ~ is_female, design=desobj, family = quasibinomial())
#> Independent Sampling design (with replacement)
#> svydesign(data = data, ids = ~1, weights = ~weight)
#> 
#> Call:  svyglm(formula = cohabit ~ is_female, design = desobj, family = quasibinomial())
#> 
#> Coefficients:
#>   (Intercept)  is_femaleTRUE  
#>       -1.1891        -0.3257  
#> 
#> Degrees of Freedom: 29 Total (i.e. Null);  28 Residual
#> Null Deviance:       30.38 
#> Residual Deviance: 30.26     AIC: NA

Created on 2024-12-30 with reprex v2.1.1

In R's glm() function for logistic regression, sample weights can be applied using the weights argument. These weights adjust the influence of each observation on the model. For a logit model:

model <- glm(formula, family = binomial(link = "logit"), data = your_data, weights = your_weights)

Here, your_weights is a numeric vector of weights corresponding to each observation. Weighted models are particularly useful for addressing oversampling, stratified samples, or varying observation importance.

Thanks. I think this is the best solution - I haven't used this package before, but I'll start learning.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.