Hi all,
I'm trying to run a simple logit regression where I estimate the (binary) odds of cohabitation with family (cohabit
) using a person's (binary) gender (is_female
). The unweighted logit regression looks great, but I get issues when I try to add weights - I get an unfeasibly huge coefficient and a warning that "fitted probabilities numerically 0 or 1 occurred." I thought the problem might be due to categories with 0 observations, but I've produced a replicable example below with nonzero counts in each category, and the issue still occurs.
The reason I'm weighting in the first place is that my data comes from a survey. Each weight is an integer that indicates the number of people this one respondent represents. So, one potentially equivalent workaround I could do is create an artificial data set where I repeat each row the number of times shown by the weight
column. But, given as my data are already several gigabytes, and weights are typically around 100, this would significantly bloat my computations.
Thanks in advance!
library(dplyr)
# Sample data
data <- structure(list(
cohabit = c(FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE,
FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE),
is_female = c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,
TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE),
weight = c(106, 157, 131, 114, 163, 191, 159, 167, 37, 62, 60, 63, 70, 78, 294,
253, 39, 79, 74, 68, 65, 68, 93, 102, 87, 252, 116, 382, 312, 48)),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -30L)
)
# Demonstrate that there nonzero counts of observations in each unique combination
# of `is_female` and `cohabit` binary variables
data |>
group_by(is_female, cohabit) |>
summarize(
total_weight = sum(weight),
count = n(),
.groups = "drop"
)
The output:
is_female cohabit total_weight count
<lgl> <lgl> <dbl> <int>
1 FALSE FALSE 1353 10
2 FALSE TRUE 412 3
3 TRUE FALSE 1742 12
4 TRUE TRUE 383 5
Now I'll show what happens when I try running the two types of regression.
# Run the unweighted logit regression - looks okay
glm(cohabit ~ is_female, family = "binomial", data = data)
The output:
Call: glm(formula = cohabit ~ is_female, family = "binomial", data = data)
Coefficients:
(Intercept) is_femaleTRUE
-1.2040 0.3285
Degrees of Freedom: 29 Total (i.e. Null); 28 Residual
Null Deviance: 34.79
Residual Deviance: 34.64 AIC: 38.64
Now I run the messed up weighted model.
# Run the weighted logit - something's wrong!
glm(cohabit ~ is_female, family = "binomial", weight = weight, data = data)
Call: glm(formula = cohabit ~ is_female, family = "binomial", data = data,
weights = weight)
Coefficients:
(Intercept) is_femaleTRUE
-4.414e+14 4.414e+14
Degrees of Freedom: 29 Total (i.e. Null); 28 Residual
Null Deviance: 3940
Residual Deviance: 31700 AIC: 31710
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
The coefficient on is_female
is humongous, and the "algorithm did not converge". In addition "fitted probabilities numerically 0 or 1 occurred" - an issue which surprises me, given that I show there are nonzero counts in each combination of is_female
and cohabit
, in the table above.