First, you did a nice job asking your question including the error and the command which threw it as well as a snippet of your data.
Here is a reproducible example though which I think will help as you can run the code and play with it yourself,
n <- 10
p <- 3
set.seed(123)
df <- data.frame(matrix(sample(4, n * p, TRUE), nrow = n, dimnames = list(NULL, c("y", "x1", "x2"))))
df[sample(n, 5), "x2"] <- NA
df
#> y x1 x2
#> 1 3 4 NA
#> 2 3 2 4
#> 3 3 2 NA
#> 4 2 1 NA
#> 5 3 2 1
#> 6 2 3 NA
#> 7 2 4 4
#> 8 2 1 2
#> 9 3 3 NA
#> 10 1 3 2
So, we've made some data with NA
values. Let's see what happens when we try to produce models from the data using a variety of na.action
choices.
It's worth noting, the default is na.omit
, so you should ahve a defensible reason for choosing something else before you do.
(m0 <- lm(y ~ x1 + x2, df))
#>
#> Call:
#> lm(formula = y ~ x1 + x2, data = df)
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 2.5811 -0.3784 0.2027
(m1 <- lm(y ~ x1 + x2, df, na.action = "na.omit"))
#>
#> Call:
#> lm(formula = y ~ x1 + x2, data = df, na.action = "na.omit")
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 2.5811 -0.3784 0.2027
(m2 <- lm(y ~ x1 + x2, df, na.action = "na.exclude"))
#>
#> Call:
#> lm(formula = y ~ x1 + x2, data = df, na.action = "na.exclude")
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 2.5811 -0.3784 0.2027
You'll notice the three results are the same, because again, "na.omit"
is the default and "na.exclude"
does the same thing, though it does a better job of keeping track of what happened as we'll see next.
fitted(m1)
#> 2 5 7 8 10
#> 2.635135 2.027027 1.878378 2.608108 1.851351
fitted(m2)
#> 1 2 3 4 5 6 7 8
#> NA 2.635135 NA NA 2.027027 NA 1.878378 2.608108
#> 9 10
#> NA 1.851351
You can see here "na.exclude"
kept track of which observations were problematic and has NA
's for the fitted values (the residuals as well).
The rest produce errors for different reasons. The first, "na.fail"
is hopefully not too hard to understand why. It tells R to throw an error if the data contains any NA values.
m3 <- lm(y ~ x1 + x2, df, na.action = "na.fail")
#> Error in na.fail.default(structure(list(y = c(3L, 3L, 3L, 2L, 3L, 2L, : missing values in object
Finally, choosing NULL
I believe has the same effect as choosing "na.pass"
which causes lm()
to simply do what you asked, no more, no less... it does no pre-processing on the data and happily throws errors when the computations fail because of the NA
's.
m4 <- lm(y ~ x1 + x2, df, na.action = NULL)
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): NA/NaN/Inf in 'x'
m5 <- lm(y ~ x1 + x2, df, na.action = "na.pass")
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): NA/NaN/Inf in 'x'
Created on 2020-09-03 by the reprex package (v0.3.0)