I am a beginner level user. I try to impute data for numerical variables for population in a dataframe. After carefull consideration, I decided to impute the data with:
mice.impute.norm.boot
I ve loaded the dataframe into R but I dont understand how to use the command now.
mice.impute.norm.boot(y, ry, x, wy = NULL, ...)
Where do I get the values for y, ry, x, wy?
Sry if thats a stupid question. I'm trying to prepare data for a bachelor thesis and I ve really hit a dead end here.
is is what's called a function signature and the key to intertpreting it is first to recall school algebra and second learning how to use the help(mice.impute.norm.boot) resource, which isn't necessarily as simple as reading the, uh, manual.
First: f(x) = y) generically. You have an object containing a collection of numbers, some of which are missing, indicated by NA. Let x stand for that object, such as a variable in the data frame you are working on. What you want is to replace the NA values with imputed values based on a multiple imputation method. You could just take the mean of the values, excluding NA values, and you should to have something to compare the results of the multiple imputation method.
The function object to do this we will call f, which uses x and, perhaps, other object to construct y in one or more steps. It can be composite, like g(f(x)) in school algebra.
You've found one in the {mice} package. You had reasons to choose that one over any of the alternatives in the missing data task view. Among the available functions in {mice} you've chosen mice.impute.norm.boot() over all of mice.impute.cart(), mice.impute.lasso.logreg(), mice.impute.lasso.norm(), mice.impute.lasso.select.logreg(), mice.impute.lasso.select.norm(), mice.impute.lda(), mice.impute.logreg.boot(), mice.impute.logreg(), mice.impute.mean(), mice.impute.midastouch(), mice.impute.mnar.logreg(), mice.impute.mpmm(), mice.impute.norm.nob(), mice.impute.norm.predict(), mice.impute.norm(), mice.impute.pmm(), mice.impute.polr(), mice.impute.polyreg(), mice.impute.quadratic(), mice.impute.rf(), mice.impute.ri()
Assuming that the choice of mice.impute.norm.boot() has been made, several steps will have been taken
Inspection of the missingness for the variable of interest
Assessment of the reasonableness of the missing at random assumption
An imputation model selected
Other variables to use to model the variable of interest selected
Whether transformation of other variables are to be use
The order in which variables should be imputed
Setting up the starting imputations and number of iterations
Selecting the number of multiply imputed datasets to generate
The package authors emphasize the importance of these questions
Please realize that these choices are always needed. The analysis in Section 2.4 imputed the nhanes data using just a minimum of specifications and relied on mice defaults. However, these default choices are not necessarily the best for your data. There is no magical setting that produces appropriate imputations in every problem. Real problems need tailoring. It is our hope that the software will invite you to go beyond the default settings.
At it's heart, these steps are what to do rather than how to do it.
With that, let's look at the help() part.
mice.impute.norm.boot() takes at least three arguments—y, ry
and x. It can also take wy but by default that is NULL.
To find out what to put, look in the section Arguments
y is easy enough—it's the vector with the missing values
ry is a vector indicating for each value of the vector whether it is present, TRUE or missing, FALSE
*x is a numeric design matrix with length(y) rows with predictors for y. Matrix x may have no missing values.
*\dots are optional arguments
Let's take these in turn. Without a reprex. See the FAQ, I have to make stuff up. I'll be using mtcars$mpg as y modified to create some missingness. You can just enter mtcars to see the starting point—it's a built-in dataset of numeric variables.
y <- mtcars$mpg
# make all the value under 11 zero and all the values over 23 NA
y <- ifelse(y < 11,0,y)
y <- ifelse(y > 23, NA,y)
y
#> [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 NA 22.8 19.2 17.8 16.4 17.3 15.2 0.0
#> [16] 0.0 14.7 NA NA NA 21.5 15.5 15.2 13.3 19.2 NA NA NA 15.8 19.7
#> [31] 15.0 21.4
My fake y covers two types of missingness, the properly entered NA and the too-common substitution of the value of zero for a measurement that wasn't taken at all.
Let's construct ry
# y is now the beginning fake data--turn it into an logical vector
# of non-missingness
ry <- ifelse(y == 0 | is.na(y),FALSE,TRUE)
ry
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
#> [13] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
#> [25] TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
Now, I need to choose some variables with which to fill in the missing pieces in y
# y has length 32, so we need a design matrix of dim of
# 32 rows times as many variables that we want to use in
# filling in y's missing values
# we'll use five--disp, hp,drat,wt,qsec
x <- mtcars[,3:7]
head(x)
#> disp hp drat wt qsec
#> Mazda RX4 160 110 3.90 2.620 16.46
#> Mazda RX4 Wag 160 110 3.90 2.875 17.02
#> Datsun 710 108 93 3.85 2.320 18.61
#> Hornet 4 Drive 258 110 3.08 3.215 19.44
#> Hornet Sportabout 360 175 3.15 3.440 17.02
#> Valiant 225 105 2.76 3.460 20.22
Let's pause a second and note that I have skipped any attempt to go through the steps above to assess whether these are appropriate variables. Take this for form only, not substance.
# how'd we do?
# what missing actually were
mtcars[which(ry == FALSE),][1]
#> mpg
#> Merc 240D 24.4
#> Cadillac Fleetwood 10.4
#> Lincoln Continental 10.4
#> Fiat 128 32.4
#> Honda Civic 30.4
#> Toyota Corolla 33.9
#> Fiat X1-9 27.3
#> Porsche 914-2 26.0
#> Lotus Europa 30.4
# what we imputed
imputed
#> [,1]
#> Merc 240D 23.65021
#> Cadillac Fleetwood 16.21280
#> Lincoln Continental 15.37891
#> Fiat 128 25.53483
#> Honda Civic 28.73862
#> Toyota Corolla 25.94176
#> Fiat X1-9 25.41096
#> Porsche 914-2 23.26414
#> Lotus Europa 20.41665
# difference
mtcars[which(ry == FALSE),][1] - imputed
#> mpg
#> Merc 240D 0.7497891
#> Cadillac Fleetwood -5.8127991
#> Lincoln Continental -4.9789128
#> Fiat 128 6.8651715
#> Honda Civic 1.6613750
#> Toyota Corolla 7.9582434
#> Fiat X1-9 1.8890447
#> Porsche 914-2 2.7358629
#> Lotus Europa 9.9833460
Sometimes ok, sometimes pretty off. But we only know because we were using known values that we only pretended were missing. In practice, we have no way to assess how good the variable selection actually is. At least without doing a lot more. In this case, my weakly informed guess is that the predictors need some weighting or other transformation to translate into good tools to impute the missing values.
Two lessons to take away
{mice} functions aren't mechanical—they require a good bit of judgement to use properly. Pick the simplest options first, then keep going until starting to lose comfort that you know what's happening.
I used a fair amount of subset operators, [ and ]. One a vector, an integer selects the index position. On a matrix or data frame it goes [row,column] but if only one is given object[7], it is the column. Notice how I used a vector of TRUE and FALSE values to subset out portions of a vector that met conditions.