mice.impute.norm.boot(y, ry, x, wy = NULL, ...) Help

is is what's called a function signature and the key to intertpreting it is first to recall school algebra and second learning how to use the help(mice.impute.norm.boot) resource, which isn't necessarily as simple as reading the, uh, manual.

First: f(x) = y) generically. You have an object containing a collection of numbers, some of which are missing, indicated by NA. Let x stand for that object, such as a variable in the data frame you are working on. What you want is to replace the NA values with imputed values based on a multiple imputation method. You could just take the mean of the values, excluding NA values, and you should to have something to compare the results of the multiple imputation method.

The function object to do this we will call f, which uses x and, perhaps, other object to construct y in one or more steps. It can be composite, like g(f(x)) in school algebra.

You've found one in the {mice} package. You had reasons to choose that one over any of the alternatives in the missing data task view. Among the available functions in {mice} you've chosen mice.impute.norm.boot() over all of mice.impute.cart(), mice.impute.lasso.logreg(), mice.impute.lasso.norm(), mice.impute.lasso.select.logreg(), mice.impute.lasso.select.norm(), mice.impute.lda(), mice.impute.logreg.boot(), mice.impute.logreg(), mice.impute.mean(), mice.impute.midastouch(), mice.impute.mnar.logreg(), mice.impute.mpmm(), mice.impute.norm.nob(), mice.impute.norm.predict(), mice.impute.norm(), mice.impute.pmm(), mice.impute.polr(), mice.impute.polyreg(), mice.impute.quadratic(), mice.impute.rf(), mice.impute.ri()

If you have a specific reason for the choice, good. But, if not, review van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software , 45 (3), 1–67 to understand why that is the appropriate choice.

Assuming that the choice of mice.impute.norm.boot() has been made, several steps will have been taken

  1. Inspection of the missingness for the variable of interest
  2. Assessment of the reasonableness of the missing at random assumption
  3. An imputation model selected
  4. Other variables to use to model the variable of interest selected
  5. Whether transformation of other variables are to be use
  6. The order in which variables should be imputed
  7. Setting up the starting imputations and number of iterations
  8. Selecting the number of multiply imputed datasets to generate

The package authors emphasize the importance of these questions

Please realize that these choices are always needed. The analysis in Section 2.4 imputed the nhanes data using just a minimum of specifications and relied on mice defaults. However, these default choices are not necessarily the best for your data. There is no magical setting that produces appropriate imputations in every problem. Real problems need tailoring. It is our hope that the software will invite you to go beyond the default settings.

At it's heart, these steps are what to do rather than how to do it.

With that, let's look at the help() part.

mice.impute.norm.boot() takes at least three arguments—y, ry
and x. It can also take wy but by default that is NULL.

To find out what to put, look in the section Arguments

  • y is easy enough—it's the vector with the missing values
  • ry is a vector indicating for each value of the vector whether it is present, TRUE or missing, FALSE
    *x is a numeric design matrix with length(y) rows with predictors for y. Matrix x may have no missing values.
    *\dots are optional arguments

Let's take these in turn. Without a reprex. See the FAQ, I have to make stuff up. I'll be using mtcars$mpg as y modified to create some missingness. You can just enter mtcars to see the starting point—it's a built-in dataset of numeric variables.

y <- mtcars$mpg
# make all the value under 11 zero and all the values over 23 NA
y <- ifelse(y < 11,0,y)
y <- ifelse(y > 23, NA,y)
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3   NA 22.8 19.2 17.8 16.4 17.3 15.2  0.0
#> [16]  0.0 14.7   NA   NA   NA 21.5 15.5 15.2 13.3 19.2   NA   NA   NA 15.8 19.7
#> [31] 15.0 21.4

Created on 2023-03-05 with reprex v2.0.2

My fake y covers two types of missingness, the properly entered NA and the too-common substitution of the value of zero for a measurement that wasn't taken at all.

Let's construct ry

# y is now the beginning fake data--turn it into an logical vector
# of non-missingness
ry <- ifelse(y == 0 | is.na(y),FALSE,TRUE)

Now, I need to choose some variables with which to fill in the missing pieces in y

# y has length 32, so we need a design matrix of dim of
# 32 rows times as many variables that we want to use in
# filling in y's missing values
# we'll use five--disp, hp,drat,wt,qsec
x <- mtcars[,3:7]
#>                   disp  hp drat    wt  qsec
#> Mazda RX4          160 110 3.90 2.620 16.46
#> Mazda RX4 Wag      160 110 3.90 2.875 17.02
#> Datsun 710         108  93 3.85 2.320 18.61
#> Hornet 4 Drive     258 110 3.08 3.215 19.44
#> Hornet Sportabout  360 175 3.15 3.440 17.02
#> Valiant            225 105 2.76 3.460 20.22

Let's pause a second and note that I have skipped any attempt to go through the steps above to assess whether these are appropriate variables. Take this for form only, not substance.

# how'd we do?
# what missing actually were
mtcars[which(ry == FALSE),][1] 
#>                      mpg
#> Merc 240D           24.4
#> Cadillac Fleetwood  10.4
#> Lincoln Continental 10.4
#> Fiat 128            32.4
#> Honda Civic         30.4
#> Toyota Corolla      33.9
#> Fiat X1-9           27.3
#> Porsche 914-2       26.0
#> Lotus Europa        30.4
# what we imputed
#>                         [,1]
#> Merc 240D           23.65021
#> Cadillac Fleetwood  16.21280
#> Lincoln Continental 15.37891
#> Fiat 128            25.53483
#> Honda Civic         28.73862
#> Toyota Corolla      25.94176
#> Fiat X1-9           25.41096
#> Porsche 914-2       23.26414
#> Lotus Europa        20.41665
# difference
mtcars[which(ry == FALSE),][1] - imputed
#>                            mpg
#> Merc 240D            0.7497891
#> Cadillac Fleetwood  -5.8127991
#> Lincoln Continental -4.9789128
#> Fiat 128             6.8651715
#> Honda Civic          1.6613750
#> Toyota Corolla       7.9582434
#> Fiat X1-9            1.8890447
#> Porsche 914-2        2.7358629
#> Lotus Europa         9.9833460

Sometimes ok, sometimes pretty off. But we only know because we were using known values that we only pretended were missing. In practice, we have no way to assess how good the variable selection actually is. At least without doing a lot more. In this case, my weakly informed guess is that the predictors need some weighting or other transformation to translate into good tools to impute the missing values.

Two lessons to take away

  1. {mice} functions aren't mechanical—they require a good bit of judgement to use properly. Pick the simplest options first, then keep going until starting to lose comfort that you know what's happening.

  2. I used a fair amount of subset operators, [ and ]. One a vector, an integer selects the index position. On a matrix or data frame it goes [row,column] but if only one is given object[7], it is the column. Notice how I used a vector of TRUE and FALSE values to subset out portions of a vector that met conditions.