To the Editor:
The impact of missing data on the validity of results has often been overlooked in the medical literature.1 Missing data are usually classified as ignorable (including missing completely at random [MCAR] and missing at random [MAR]), and nonignorable (missing not-at-random [MNAR]).2 It is not possible to distinguish between ignorable and nonignorable missing data using observed data.
For analysis of ignorable missing data, multiple imputation—a relatively flexible and general purpose approach—is available in standard statistical software,2 and is increasingly used. In contrast, the MNAR hypothesis is rarely explored, although it has been recommended to perform sensitivity analyses through various models for the nonresponse mechanism.2 The mixture modeling principle, which assumes that the variables of interest have different distributions according to the status missing or nonmissing, is an attractive tool.3 The impact of variations in the imputation model on the overall results helps to assess their robustness.
One reason for not systematically performing sensitivity analysis could be the lack of simple tools for implementation. We propose a package with an adaptation of the “mice” function from R® software to easily perform sensitivity analysis under various scenarios of nonresponse mechanisms.4,5
The algorithm MICE allows multiple imputation for data sets with mixed types of variables (continuous, binary, categorical). We propose a strategy in 3 steps:
- Fit an imputation model under the ignorable-missing-data hypothesis by calling the function “mice.”
- Modify the imputation model by specifying supplementary parameters θ as arguments for the “sens.mice” function. For binary or categorical variables, this parameter is the odds ratio comparing the odds of the modality of interest among subjects with missing value with the odds among subjects without missing value. For continuous variables, this is the difference in expected values. For standardized variables, this difference can be expressed as a coefficient of variation.
- Impute the missing data using the function sens.mice, resulting in a “mids” object which contains the newly imputed data sets.
In the absence of historical data on the missing-data mechanism, we suggest looking for parameter values that modify the overall conclusions, ie, the test results or a point estimate outside of the original confidence interval. These values should be consistent with the targeted exposure effect and should correspond to reasonable hypotheses supported by epidemiologic evidences.
As an illustrative example, we apply a sensitivity analysis supposing MNAR data on the data set “CHAIN”6 included in the package “mi.”7 We present results from 4 imputation models that relate poor mental health (as measured by a binary variable) to the self-reported viral load (taken as either a binary or a trinary variable) adjusted on 5 covariates. We assumed that nonresponders were more likely to have high viral load than responders. The magnitude of the variations was in the range of the observed odds ratio between viral load and mental health among complete cases. We compared the results with the complete cases analysis and with the multiple imputation analysis assuming MAR data (Table). The resulting odds ratio was decreased but still substantial. Estimates were robust to the explored MNAR scenarios as conclusions were not modified as compared with multiple imputation; magnitude of the effects was smaller than the assumed variations.
The “SensMice” package is provided as an eAppendix (http://links.lww.com/EDE/A455).
Both the package and the reference manual are freely available at: http://lertim.fr/Members/rgiorgi/DossierPublic/fonctions-r-s/.
A sensitivity analysis under different scenarios of nonresponse mechanism could be easily performed using the proposed package. It is of particular interest when the mechanism is highly suspected to be nonignorable, eg, for self-reported characteristics as psychologic disorders, quality of life, or income.8
We thank David Hajage and Yann De Rycke for their helpful contributions and Michel Chavance for stimulating discussions.
LERTIM, EA 3283
1.Klebanoff MA, Cole SR. Use of multiple imputation in the epidemiologic literature. Am J Epidemiol.
2.Rubin DB. Multiple Imputation for Nonresponse in Surveys.
New York: John Wiley & Sons; 1987.
3.Kmetic A, Joseph L, Berger C, Tenenhouse A. Multiple imputation to account for missing data in a survey: estimating the prevalence of osteoporosis. Epidemiology.
4.R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing Vienna, Austria, 2009. Available at: http://www.R-project.org
5.Van Buuren S, Groothuis-Oudshoorn K. MICE: Multivariate Imputation by Chained Equations library.
R package version 2.3.; 2010.
7.Gelman A, Hill J, Su Y-S, Yajima M, Pittau MG. MI: Missing Data Imputation and Model Checking.
R package version 0.09-11; 2010.
8.Bono C, Ried LD, Kimberlin C, Vogel B. Missing data on the Center for Epidemiologic Studies Depression Scale: a comparison of 4 imputation techniques. Res Social Adm Pharm.