I understand that we can use the MICE package to deal with missing data. However, I want to know if there is a maximum percentage of our missing data for this rule. For example, it probably won't be a good idea to impute a variable with ~80% missing data.
I currently have a variable with 25% missing data (n=12; my overall sample size is 48). Is it okay to impute this variable or should I remove this from my analyses?
it probably won't be a good idea to impute a variable with ~80% missing data is not entirely accurate as it depends on the nature of the missingness and the information in the other variables that are non-missing.
The first is the nature of missingness. Are the data missing at random, observed at random, or missing completely at random? There are many useful resources on the definitions of MAR, OAR, and MCAR.
The second is how much information you happen to have about the missings. If there are close correlates of the variable with missing values and the relationship between the correlates and the variable with missing data is independent of the missingness, then imputing it allows you to retain the information you know and generate reasonable imputations of the rest so that the information that you do have observed can be used. If the data are panel/longitudinal, the two-dimensions can often be usefully deployed as informative for the missing values.
Personally, I am more persuaded by the one should always impute data unless you have good reasons not to that relate to the above conditions. Omitting the variable is throwing away information which I am not convinced one should ever do without good reasons.
@rwalker Hi, thanks for these valuable inputs. So, would you use imputed data throughout all the analyses including t-tests, chi-squared tests, etc?
I only see papers using imputed data when they are running regression analyses so I am unsure if it is necessary to also use those in other tests!
For univariate analysis, nothing prevents you from reporting complete and incomplete data results. This is probably the most transparent way of doing things. In general, the term multiple imputation refers to imputation on multiple variables, otherwise there is no real need for chaining.
You are right that it is typically deployed in regression type analyses because the missing data cause us to lose observed information on other covariates. And you were right to infer that this was the set of models I had in mind as the target. For single or two variable analyses, it is a bit different of a question.
My final comment on this would be that it is a really good idea to read on and understand the methods you intend to deploy; in this case, I would strongly encourage you to consider work by Schafer, King et al, and others on multiple imputation to bolster your understanding of the use-cases and the tradeoffs.
Thank you @rwalker !
So, in summary, do we simply drop/ignore missing data when it comes to the univariate analyses? If not, (given MICE is not the option), what else can we do to handle the missing data when it comes to the univariate analyses?
"I would strongly encourage you to consider work by Schafer, King et al, and others on multiple imputation to bolster your understanding of the use-cases and the tradeoffs."
Joe Schafer's book on missing data, the Amelia package and associated papers by Gary King, James Honaker and others, Rubin's work or other places are worthy of consideration. If a pure univariate analysis, there is no information to use to impute data outside of what is observed on that variable. This is not a simple topic and I, as a rule, refrain from oversimplifying problems, especially with limited or no understanding of them, as is the case here at least to me.