Imputing Values with MICE

How to use MICE to impute values.

There are three techniques usually used to deal with missing values:

  • Ignore them and just use complete cases. This is acceptable if <5% of cases are incomplete and missingness is random (Azur, 2004)

  • Impute values with their mean, median, or mode. A mean imputation leaves the overall mean unchanged, but artificially reduces the variable’s variance (Alice, 2018).

  • Impute values with multivariate imputation by chained equations (MICE). This method creates multiple predictions of each missing value, allowing the researcher to account for uncertainty in the imputations.

This tutorial explains how to implement MICE using the mice R package.

How MICE works

MICE is a multiple imputation technique. It assumes missing values are “missing at random”, meaning that after controlling for all the variables in the data, any remaining missingingness is completely random. If the data fails this assumption, MICE produces biased estimates.

MICE works by fitting a series of regression models of each variable conditioned on the other variables. The “chained equations” aspect of MICE is four steps.

  1. Impute simple placeholder values for all missing data (e.g., using the variable mean).
  2. For each variable with missing values, remove its placeholder values, fit a regression model conditioned on the other data variables, then use the model to replace the missing values with the model predictions.
  3. Repeat step 2) for a number of cycles, usually 5, each cycle replacing the missing values with improved estimates. By the last cycle, the imputations should be stable.

The “multiple” aspect of MICE is repeating the procedure multiple times, usually 5, to produce a range of imputed values. From this set of values, the researcher can either choose one (the complete() function in mice defaults to choosing the first data set), or pooling the imputations.

Important Considerations

Only include variables that are either in your analysis model (including the response variable) or that are related to the variables that will receive imputations. E.g., if you are imputing height, then exclude subject surname from the MICE procedure. If your analysis model includes interaction terms, include them as well. The MICE model should be more general than your model.

If a variable can be constructed from other variables, then impute the other variable values instead of the summary variable. E.g., body mass index (BMI) is a summary of height and weight. If you have height and weight as variables, impute them then derive BMI rather than imputing BMI directly.

MICE in R

The following case study is based on the vignette’s on the mice package github site. nhanes is a 25x4 data set in the mice package with missing values in several cols.

library(tidyverse)
library(mice)

skimr::skim(nhanes)
Table 1: Data summary
Name nhanes
Number of rows 25
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1.00 1.76 0.83 1.0 1.00 2.00 2.00 3.0 ▇▁▅▁▃
bmi 9 0.64 26.56 4.22 20.4 22.65 26.75 28.92 35.3 ▇▅▆▃▃
hyp 8 0.68 1.24 0.44 1.0 1.00 1.00 1.00 2.0 ▇▁▁▁▂
chl 10 0.60 191.40 45.22 113.0 185.00 187.00 212.00 284.0 ▃▁▇▃▁

You might start with an exploration of the missingness pattern. mice::md.pattern() helps you investigate the structure of missing values. The matrix is the missingness pattern (red = missing, blue = not missing). The left column of numbers is the count of instances of the pattern in the data. The right column of numbers is the number of variables in the pattern with missing values (i.e., the number of red boxes). The bottom row of numbers is the count of rows in the data with the number of missing values for the variable. nhanes has 13 rows with complete cases, three rows where only chl is null, etc. chl, hyp, and bmi are simultaneously null 7 times. In all, chl is null 10 times.

mice::md.pattern(nhanes) %>% invisible()

You can simply pass the data frame into mice() to produce the imputations, then complete() to return a new data frame with the missing values replaced by the imputations. mice::mice() returns an object of type multiply imputed data set (mids) that includes the original data, the imputed values, and other metadata. The mice() defaults are to run 5 iterations (maxit = 5) per imputation and m = 5 imputations per missing value. The imputed values are stored as vectors in a list object, imp$imp. Function complete() returns a new data frame with missing values replaced by their imputed value.

imp <- mice::mice(nhanes, print = FALSE, m = 1)

nhanes_2 <- complete(imp)

That’s technically all you have to do, dat %>% mice() %>% complete(). But there are a few things more you should know.

Fine Tune your Analysis

First, you may not want to use all the variables in your data frame to estimate the missing values. E.g., you may have an id column or a character descriptor. You can specify the vars to use with the predctorMatrix parameter and quickpred() utility function. The quickpred() utility function creates a matrix of variables where 1 means it is used in the prediction of the other variable. Below, bmi and hyp are excluded from the list of variables contributing to imputations (although they receive imputed values).

Also, you can set a seed value to guarantee reproducibility.

# What it looks like
quickpred(nhanes, exclude = c("bmi", "hyp"))
##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   0   1
## hyp   1   0   0   1
## chl   1   0   0   0
# In use
imp <- nhanes %>% 
  mice(
    predictorMatrix = quickpred(nhanes, exclude = c("bmi", "hyp")), 
    print = FALSE, 
    seed = 123
  )

You can also evaluate your imputations by inspecting the iteration process to see if the imputed values converged to stable values. The plot() of the mids object shows the mean (left col) and standard deviation (right) col of each variable (rows) imputation iteration. A separate line plot is produced for each imputation. Below you see the 5 iterations (default) along the x-axis and the 5 imputations (default) as colors. The mean bmi ranged from 25-29 in the first iteration and ended at 25-28 by the fifth iteration.

plot(imp)

You can produce a density plot to compare the imputation distribution (red) to the non-imputation distribution. They should overlap.

densityplot(imp)

Parting Thoughts

Even though you produce 5 imputations by default, complete() only pulls the first one for you by default. Your can use the others for diagnostics like the plot() above.

The regression model applied to each variable can be independent (OLS for numeric, logistic for binary, Poisson for count, etc.), or even a non-regression model such a classification tree. The mids object returned by mice() tells you which it chose. You can override with the method or defaultMethod parameters.

If you just need a mean or median imputation, mice() can do that for you. Set maxiter = 1 and m = 1 and method = "mean".

imp <- nhanes %>% 
  mice(
    m = 1,
    method = "mean",
    maxit = 1,
    print = FALSE
  )

Examples of MICE in action:

References

Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011 Mar;20(1):40-9. doi: 10.1002/mpr.329. PMID: 21499542; PMCID: PMC3074241. html.

Alice, Michy. Imputing Missing Data with R; MICE package. Published Oct 4, 2015, Updated May 14, 2018. html.

mice R package on GitHub with links to vignettes. github

van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03. jstatsoft.

 Share!

 
comments powered by Disqus