Journal Logo


Handling of Missing Data

Budhiraja, Pooja MD1; Kaplan, Bruce MD2; Mustafa, Reem A. MD, MPH, PhD1

Author Information
doi: 10.1097/TP.0000000000002865
  • Free

Missing data can introduce bias due to systematic differences between the observed and unobserved data, which can compromise the strength of evidence. Participants may withdraw from trials due to adverse events, lack of efficacy, tolerability, or disease severity which can bias the findings. Biased results from missing data can mislead physicians and guideline developers. In this commentary, we focus on discussing the implications of missing outcome data (MOD).

Presence of MOD is a frequent problem that affects our certainty in the results. The median loss to follow-up in published trials of antithrombotic agents has been reported as 0.9%.1 However, after reviewing available documents to the Food and Drug Administration, Marciniak et al1 reported that the loss to follow-up rate was actually13%. This suggests that MOD might be more common than reported. A systemic review of Randomized Controlled Trials (RCTS) published in top 5 general medical journals reported that one-third of the trials lost statistical significance with different plausible assumptions about the MOD.2

Majority of the transplant trials are focused on comparing the effect of different interventions including immunosuppressants on acute rejection, graft function, and survival. These drugs have narrow therapeutic index and could lead to dropout of the participant due to toxicity. Additionally, the occurrence of patient-important outcomes like survival and graft survival require long follow-up which can also increase the chances of MOD.

Belatacept trial, one of the important long-term studies in the transplantation literature, enrolled 666 subjects to Belatacept or cyclosporine, of which 45% discontinued the medication, including 7% of subjects who died.3 The trial reported 43% reduction in the risk of death or graft loss with higher risk of posttransplant lymphoproliferative disorder with Belatacept. The high rate of dropout hinders accurate conclusion about the safety of Belatacept. The mean estimated glomerular filtration rate (eGFR) increased over the 7-year period with Belatacept but declined with cyclosporine despite higher acute severe rejection with Belatacept. One plausible explanation for improvement in GFR other than calcineurin inhibitors sparing effect, is that participants who dropped out might have had a worse eGFR leading to an average eGFR increases when removed from the analysis. Similarly other calcineurin inhibitors sparing trials comparing Sirolimus with cyclosporine have reported improvement in eGFR with Sirolimus but the group had high dropout rate (20%–40%) mainly due to adverse effects.4,5

RCTs require significant resources and their results are the least prone to bias when conducted well. However, due to the frequent occurrence of MOD, the National Research Council report provides guidance for design and conduction of RCTs to avoid MOD and suggestions for handling MOD.6,7 Similarly, the Consolidated Standards of Reporting Trials provides authors with a checklist of items that should be included in reports of RCTs and a diagram for documenting the flow of participants through a trial along with reasons for dropout to ensure transparent and complete reporting.8

While designing a trial, one should consider the possible reasons for MOD, have strategies to avoid it, and a plan for adjusting for it if it happens. Efforts should focus on maintaining subjects in trials, to consent them for collection of data on reasons for dropout, and monitoring the subsequent treatment and side effects even after dropout. Increasing the sample size or conducting intention to treat does not adjust for reasons for MOD. Evidently, dropout from moving out of city has a different bearing on the results than dropout due to side effects. If after all efforts, MOD occurs, there are statistical methods available to adjust for MOD. This requires understanding of the mechanisms of MOD, which can be missing completely at random (MCAR), missing at random (MAR), and not missing at random. Table 1 summarizes definitions and examples of these mechanisms.9

Definitions, examples, and way of handling missing data based on the mechanisms


There are different approaches for handling MOD that vary based on assumptions about MOD mechanisms. The common approaches include (1) complete case analysis; (2) imputations (single or multiple); and (3) other common models, such as Inverse Probability Weighting; Double Robusts method, which combines the Inverse Probability Weighting with a predictive imputation model; likelihood based methods; and cox proportional hazard model for survival data.7

Complete case analysis assumes MCAR and includes only those with observed outcome data in the analysis. This causes loss of potentially valuable information about the incomplete cases and MCAR may be hard to justify. Investigators can impute values for MOD rather than ignoring them. Single Imputations (SI) provides a single imputed value to missing variable. SI can be (1) regression imputation, using regression analysis to impute value from available data, (2) hot deck method, which includes imputing values from similar cases, and (3) last observation carried forward (LOCF) for repeated measures.7 LOCF assumes those who withdrew did not change since the last follow-up. This is not scientifically justified as early dropout can leave imbalance and ignore the time-dependent response to treatment. For example, if the trial assesses the effect of new drug on kidney function and GFR is checked at day 0 and yearly for 3 years. If the subject dropped out at year 2, then the last GFR from year 2 will be imputed for year 3. This is problematic; if the drug is associated with improvement in GFR over time, then the results may underestimate its effect on the improvement of GFR. Additionally, LOCF can lead to missing important side effects if they occur after the dropout. SI should not be used unless it is based on scientifically justified assumption otherwise imputed data can be biased depending on the assumption and overestimate statistical precision. It has occasionally been used as a conservative approach, an example would be assuming worst case scenario in the treatment group. Multiple Imputation(MI) is the favored method and takes into account the uncertainty associated with missing values.

MI is a commonly used approach in handling MOD. MI model generates range of S plausible values for each missing value, thereby generating S multiple completed datasets incorporating the uncertainty about the MOD.10 It uses analytical model to predict the missing value from the baseline, covariate data, and available outcome data. The S generated datasets are then analyzed separately and the results from the analysis are combined into a single inference. The imputations are generally conducted across the cohort as it assumes that MOD is missing at random. However, imputations may be carried out separately for randomized group. This provides less biased average treatment effect and is useful when one is expecting the groups to be different, example in disease severity that may potentially have affected the MOD.11 MI does well for both univariate (binary or continuous) and multivariate outcomes (longitudinal data of repeated measures of outcome within individual). The 2 common models of MI for univariate outcomes are fully conditional specification univariate imputation model and multivariate normal imputation.12 Fully conditional specification uses univariate imputation model and applies logistic regression to impute incomplete binary variables and linear regression for continuous covariates. Multivariate normal imputation assumes that all variables in the imputation model follow a multivariate normal distribution. For multivariate outcomes, imputation models such as mixed effects models and hierarchical linear models can be used.13 Likelihood-based estimation of a linear mixed model is a popular alternative to MI for handling MOD in a multivariate outcome.

The ability to incorporate covariates and auxiliary variables, is considered one of the key features of MI. But missing variable can be a common occurrence especially in drop outs. Imputations can be done for covariates but under these conditions, some of the value of MI may be lost and other methods of analysis may be preferable.

When MOD is extensive, sensitivity analysis is recommended to assess the robustness of inference to different assumptions. The authors should be transparent about the assumptions made for handling MOD. These assumptions are subjective and any statistical method of correction cannot provide unbiased estimate of true effect.

Influence of MOD should be kept in mind while designing, conducting, and analyzing trials. While MOD may not be completely avoidable, documenting reasons for missingness and following the dropouts can help understand MOD mechanisms, formulate assumptions, and conduct valid analysis. Clinicians and decision makers cognizance about the impact of MOD on the evidence is important before incorporating it into practice.


1. Marciniak TA, Cherepanov V, Golukhova E, et al. Drug discontinuation and follow-up rates in oral antithrombotic trials. JAMA Intern Med. 2016; 176:257–259
2. Akl EA, Briel M, You JJ, et al. Potential impact on estimated treatment effects of information lost to follow-up in randomised controlled trials (LOST-IT): systematic review. Bmj. 2012; 344:e2809
3. Vincenti F, Rostaing L, Grinyo J, et al. Belatacept and long-term outcomes in kidney transplantation. N Engl J Med. 2016; 374:333–343
4. Gatault P, Bertrand D, Büchler M, et al. Eight-year results of the spiesser study, a randomized trial comparing de novo sirolimus and cyclosporine in renal transplantation. Transpl Int. 2016; 29:41–50
5. Büchler M, Caillard S, Barbier S, et al; SPIESSER Group. Sirolimus versus cyclosporine in kidney recipients receiving thymoglobulin, mycophenolate mofetil and a 6-month course of steroids. Am J Transplant. 2007; 7:2522–2531
6. Little RJ, D’Agostino R, Cohen ML, et al. The prevention and treatment of missing data in clinical trials. N Engl J Med. 2012; 367:1355–1360
7. The Prevention and Treatment of Missing Data in Clinical Trials. 2010. Washington (DC)
8. Moher D, Hopewell S, Schulz KF, et al. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. Bmj. 2010; 340:c869
9. Rubin DB. Inference and missing data. Biometrika. 1976; 63:581–590
10. van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007; 16:219–242
11. Sullivan TR, White IR, Salter AB, et al. Should multiple imputation be the method of choice for handling missing data in randomized trials? Stat Methods Med Res. 2018; 27:2610–2626
12. Sullivan TR, Lee KJ, Ryan P, et al. Multiple imputation for handling missing outcome data when estimating the relative risk. BMC Med Res Methodol. 2017; 17:134
13. Lang KM, Little TD. Principled missing data treatments. Prev Sci. 2018; 19:284–294
Copyright © 2020 Wolters Kluwer Health, Inc. All rights reserved.