Inverse Probability of Treatment Weighting and Confounder Missingness in Electronic Health Record-based Analyses: A Comparison of Approaches Using Plasmode Simulation : Epidemiology

Secondary Logo

Journal Logo


Inverse Probability of Treatment Weighting and Confounder Missingness in Electronic Health Record-based Analyses: A Comparison of Approaches Using Plasmode Simulation

Vader, Daniel T.a; Mamtani, Ronacb; Li, Yuna; Griffith, Sandra D.c; Calip, Gregory S.c; Hubbard, Rebecca A.a

Author Information
Epidemiology 34(4):p 520-530, July 2023. | DOI: 10.1097/EDE.0000000000001618



Electronic health record (EHR) data represent a critical resource for comparative effectiveness research, allowing investigators to study intervention effects in real-world settings with large patient samples. However, high levels of missingness in confounder variables is common, challenging the perceived validity of EHR-based investigations.


We investigated performance of multiple imputation and propensity score (PS) calibration when conducting inverse probability of treatment weights (IPTW)-based comparative effectiveness research using EHR data with missingness in confounder variables and outcome misclassification. Our motivating example compared effectiveness of immunotherapy versus chemotherapy treatment of advanced bladder cancer with missingness in a key prognostic variable. We captured complexity in EHR data structures using a plasmode simulation approach to spike investigator-defined effects into resamples of a cohort of 4361 patients from a nationwide deidentified EHR-derived database. We characterized statistical properties of IPTW hazard ratio estimates when using multiple imputation or PS calibration missingness approaches.


Multiple imputation and PS calibration performed similarly, maintaining ≤0.05 absolute bias in the marginal hazard ratio even when ≥50% of subjects had missing at random or missing not at random confounder data. Multiple imputation required greater computational resources, taking nearly 40 times as long as PS calibration to complete. Outcome misclassification minimally increased bias of both methods.


Our results support multiple imputation and PS calibration approaches to missingness in missing completely at random or missing at random confounder variables in EHR-based IPTW comparative effectiveness analyses, even with missingness ≥50%. PS calibration represents a computationally efficient alternative to multiple imputation.

Electronic health record (EHR) data represent a critical resource for 21st century comparative effectiveness research, allowing investigators to study intervention effects in real world settings with large patient samples. The volume of EHR data rapidly increased over the past 2 decades as the proportion of acute care hospitals with basic EHR systems in the US rose from 9% in 2008 to 94% in 2017.1,2 In 2016, the Cures Act mandated improvements to the accessibility and availability of EHR data, further elevating its importance to comparative effectiveness research.3,4

Although advantages of EHR data include quantity and availability, quality presents a persistent challenge. High levels of missingness in EHR data are common, and the proportion missing for individual variables can exceed 50% or even 90%.5 Although exposure information (such as prescription medication data) and outcomes (such as mortality) are reasonably well captured in EHR data, missingness in confounder variables (including symptoms, disease severity, and social determinants of health) is common and can lead to biased results when the analytic approach does not adequately handle this missingness.5–8

Consider a study investigating the comparative effects of 2 cancer therapeutics–immunotherapy and carboplatin-based chemotherapy–on overall survival using EHR-derived data.9 In such an analysis, Eastern Cooperative Oncology Group performance status (ECOG PS), which summarizes patient level of functioning,10 is an important baseline confounder as it affects both treatment choice and survival. However, ECOG PS missingness in EHR-derived data is high, sometimes exceeding 50%.11 Analyses that inadequately account for missing data may lead investigators to incorrectly conclude that one treatment approach is inferior or equivalent to another, adversely affecting policy, practice, and patient outcomes in the future.

In comparative effectiveness research, confounder control using propensity score-based inverse probability of treatment weights (IPTWs) is appealing because it targets relationships between confounders and treatment (which may be easier to capture than relationships with outcome12), and directly estimates marginal treatment effect (which can be easier to interpret and communicate than conditional treatment effect).13,14 However, there is limited guidance on which missing data approaches perform adequately (i.e., with minimal bias) and under what conditions when conducting comparative effectiveness research using IPTWs. Perhaps due to this lack of guidance, most investigators either ignore missing data or use crude missing data methods. Among comparative effectiveness articles published between 2016 and 2017 that used propensity score methods and observational data, 55% did not describe a missing data approach and 29% used complete case analysis.15

Multiple imputation (MI) is second only to complete case in popularity in propensity score-based comparative effectiveness research.15 However, investigators may be unfamiliar with the unique considerations of incorporating multiple imputation into IPTW-based propensity score analyses,16–19 even if they are familiar with multiple imputation in other contexts. Propensity score (PS) calibration20–22 is a seldom-employed and less computationally demanding alternative to multiple imputation. PS calibration uses regression calibration23–25 to account for the impact of missing confounder data on estimates of treatment propensity, using patients with complete confounder data as a validation sample for patients with missing confounder data. PS calibration could be a useful alternative when very large patient samples and large numbers of confounders with missing data make multiple imputation computationally infeasible, but it is unclear how well either perform in IPTW-based comparative effectiveness research using EHR data. Moreover, researchers may be hesitant to trust any missing data approach when applied to EHR data with high levels of missingness in important confounders (such as ECOG PS in advanced cancer treatment research).

Prior research has not directly compared the performance of multiple imputation and PS calibration in IPTW-based comparative effectiveness analyses or considered the complex dependencies between variables in EHR data. The primary objective of our study was to compare the performance of complete case, multiple imputation, and PS calibration in the setting of an EHR-based comparative effectiveness analysis. Given that outcome variables in EHR data may be subject to misclassification, we aimed to determine sensitivity of missing data approach performance to levels of mortality outcome misclassification anticipated to occur based on prior validation studies conducted in real-world data.26 The setting features use of an IPTW-based analysis to estimate population level treatment effects, high levels of missing confounder data, and complex missing data patterns. We ground our evaluation of missingness methods in a real-world example in which we wish to estimate the effect of treatment with immunotherapy on survival among patients with advanced bladder cancer. This comparison is motivated by prior research that quantified differences in survival between immunotherapy and chemotherapy treated patients using oncology EHR-derived data featuring high levels of missingness in confounder variables.9



We simulated data via a plasmode approach,27 which involves resampling from a real-world dataset and spiking in an investigator-defined relation between treatment and outcome. By combining real covariate data with an investigator-defined outcome data generating process, plasmodes maintain some of the complex relationships that exist in the real-world dataset while simultaneously allowing for a user-specified exposure–outcome causal effect.28 Compared with fully synthetic data, this hybrid approach allows comparison of estimates from alternative statistical approaches in data contexts that may better reflect real-world data structures and in which the true causal effect is known, facilitating estimation of bias.

This study was determined to be exempt from review by the University of Pennsylvania Institutional Review Board.

Advanced Bladder Cancer EHR Data

The real-world data foundation for our simulations was the nationwide (US-based) Flatiron Health EHR-derived advanced bladder cancer deidentified database. The Flatiron Health database is a longitudinal database, composed of deidentified patient-level structured and unstructured data, curated via technology-enabled abstraction.29,30 Patients in our cohort began first-line treatment for advanced bladder cancer between January 2011 and June 2020, and the de-identified data originated from ~280 cancer clinics (~800 sites of care) during this period.

The outcome of interest was overall survival, defined as the number of months between first-line treatment initiation, determined from structured EHR data plus abstraction, and death, measured using a composite mortality end point.26 We censored patients at the date of last structured activity recorded in the database. The exposure of interest was treatment with a regimen including an immune checkpoint inhibitor (immunotherapy) versus treatment with a regimen not including a checkpoint inhibitor (chemotherapy). Covariables included the following: ECOG PS, race–ethnicity (nn-Hispanic Black, Hispanic, non-Hispanic White, or other), gender (male or female), smoking status (ever vs. never smoker), surgical status (any cystectomy/other relevant surgery or none), and age.

From these potential confounders, our simulations focused primarily on missing ECOG PS. ECOG PS uses a scale from 0 to 5 to summarize a patient’s level of functioning (ability to care for themselves, daily activity, and physical ability). Higher scores are associated with decreased survival and increased probability of treatment with immunotherapy.31 Advanced cancer care patients commonly have missing ECOG PS and the precise causes of missingness in ECOG PS are unknown. In our simulations and analysis, we categorized ECOG PS as 0, 1, or ≥2.

Simulated Data

In summary, generating simulated data sets involved the following: (1) modeling outcome and censoring processes in the originating data; (2) altering the outcome model using an investigator-defined treatment effect; (3) simulating covariable data by sampling with replacement from the observed data; and (4) simulating outcome data based on the sampled covariable data, investigator-specified outcome model, and estimated censoring model. eFigures 1 and 2; depict this process.

In step 1, we modeled outcome and censoring in the originating complete case data using Cox regression, including treatment and all covariables as predictors. In step 2, we began by replacing the treatment effect coefficient in the fitted mortality model with coefficients corresponding to a null (ln[1.0]), weak (ln[0.9]), or strong (ln[0.5]) protective effect (eFigure 1; Next, using the censoring model and altered mortality model, we generated subject-specific Breslow estimates32 of mortality-free and censoring-free survival. These estimates were altered to maintain the event rate observed in the originating data (Supplemental Section 1.1;

In step 3, we generated 1000 data sets of sample size 4,361 (the sample size of the observed data) for each investigator-defined treatment effect strength by sampling with replacement from the observed complete case data. Next, in step 4, we simulated time-to-death based on the subject-specific event-free survival function using the inverse transform method.33 The same process was repeated for time-to-censoring. We used the shorter of time-to-death or time-to-censoring to assign event status and time for each subject.

Generating Missingness

In each plasmode dataset, we simulated missingness in confounder data using the amputation approach for generating multivariate missingness described by Schouten and colleagues.34 The steps were as follows: (1) choose missing data mechanism and marginal probability of missingness; (2) randomly assign all patients to one of the 3 subsets, where each subset represents a different potential missing data pattern; (3) calculate and assign patient-specific probabilities of having missing data based on covariables and assigned missing data pattern subset; and (4) select patients for data removal and remove data. This process is described in detail in Supplemental Section 1.3; and eFigure 3;

We examined 3 marginal missing data probabilities–10%, 30%, and 70%–where data were missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). In step 2, we assigned 80% of patients to subset A, 10% to subset B, and 10% to subset C. Each of these subsets represented a different candidate missing data pattern, meaning that patients would have missing data for a predefined set of confounder variables if selected to have missing data later in step 4. Patients would have missing ECOG PS in subset A; race–ethnicity, smoking status, and surgical status in subset B; or ECOG PS, race–ethnicity, smoking status, and surgical status in subset C. In step 3, we assigned patient-specific missing data probabilities by assigning weights using linear weighted sum scores models and standardizing those weights using a cumulative logistic distribution function (shifted so that the marginal probability of missingness matched the value selected in step 1).34 Predictors in these models included the following: treatment, gender, race–ethnicity, ECOG PS, smoking status, surgery, age, outcome, and/or follow-up time.

We varied the model coefficients to match the desired missing data mechanism within each subset (eTable 1; MAR missingness depended on treatment and outcome, but only depended on confounders when those confounders were observed (not missing) for a given participant. MNAR missingness additionally depended on confounders even when they were not observed (missing) for a given participant. In both MAR and MNAR models, ECOG PS = 1 or 2 had 2 times and ECOG PS = 3 or 4 had 4 times the strength of all other coefficients to increase the influence of this key confounder on missingness.


Because this study examines a time-to-event outcome, simulating outcome misclassification necessitated modifying both event status and time. Generally, we assumed that false negatives (incorrectly censored observations) would likely be assigned a shorter time-to-censoring than their true time-to-outcome.

Prior work examining the validity of the Flatiron Health bladder cancer mortality variable estimated sensitivity (probability of correctly identifying death among those who had died) at 90.2% and specificity (probability of correctly identifying survival among those who had survived) at 96.6%.26 Using these statistics, we simulated observed outcome status for true events from a Bernoulli distribution with probability given by sensitivity (90.2%) and for true-censored observations with probability given by 1 minus specificity (3.4%). Among patients incorrectly classified as alive, we reduced follow-up time relative to the originally recorded time-to-death by 0, 1, 2, or 3 months (Supplemental Section 1.4; We did not alter follow-up time for false positives, setting event time equal to the originally recorded censoring time.

We simulated both independent misclassification and dependent misclassification within all of the previously described missing data scenarios (eFigure 4; Under independent misclassification, we used the same sensitivity (90.2%) and specificity (96.6%) for all patients. Under dependent misclassification, we maintained a marginal sensitivity of 90.2% but attenuated conditional sensitivity with increasing ECOG values. Patients with ECOG PS = 1 had 0.9 times and patients with ECOG PS ≥2 had 0.8 times the sensitivity of patients with ECOG PS = 0. This reflects a context in which patients with poorer functional status are more likely to have incomplete mortality data captured by the healthcare system. Specificity remained unaltered.

Statistical Analysis

We estimated the marginal association between immunotherapy and overall survival using IPTW and one of the 3 missing data approaches: complete-case analysis, multiple imputation, or propensity score calibration (PSC).

Complete Case Analysis

Complete case analysis excluded all subjects with any missing data. We used multivariable logistic regression to estimate immunotherapy treatment propensity scores, calculating IPTWs and stabilizing the IPTWs by the marginal probability of treatment.35,36 Variables included in the propensity score model were gender, race–ethnicity, ECOG PS, smoking status, surgery, and age. All variables used their previously described categorizations. We estimated the association between immunotherapy and overall survival using weighted Cox proportional hazards regression (with IPTWs included as weights).

Multiple Imputation

We imputed missing data using multiple imputation via chained equations with 10 imputations.37 Continuous variables were imputed using predictive mean matching, dichotomous variables with logistic regression, and ordinal variables with proportional odds models. In addition to the primary covariables, the Nelson-Aalen estimator of the cumulative mortality hazard rate and an event indicator were included as predictors in the imputation models.38 Consistent with the within pooling approach to multiple imputation for IPTW models, we estimated stabilized IPTWs for each imputed data set as described earlier and subsequently included in weighted Cox outcome models.17,18 We did not use the alternative across approach because it has been demonstrated to produce biased results with poor confidence interval coverage.16–18 Finally, we pooled treatment effect estimates and their standard errors across imputations using Rubin’s rules.39

Propensity Score Calibration

The PS calibration approach consisted of 5 steps: (1) calculate error-prone propensity scores, (2) calculate gold-standard propensity scores, (3) calculate calibrated error-prone propensity scores, (4) calculate stabilized IPTWs, and (5) fit the outcome model.20,40 Using multivariable regression, we fit the error-prone model using the entire patient sample but only included confounders with complete data (gender, surgery, and age) as predictors. We fit the gold-standard propensity score model using the full set of confounder variables but only within a validation subset of patients who had complete data across all confounders. We fit the calibrated error-prone model by regressing the gold-standard propensity scores on the error-prone scores and treatment. Stabilized IPTWs were based on the gold-standard scores for patients with complete data and calibrated error-prone scores for patients with incomplete data. As above, we used a weighted Cox proportional hazards model to estimate the marginal association between immunotherapy and overall survival.

Performance Measures

We described the performance of missing data approaches using bias, mean squared error (MSE), and 95% confidence interval coverage probabilities. Because the simulation model used conditional treatment effects whereas the IPTW Cox models estimated marginal effects, the true marginal effect was derived by simulating a plasmode data set of sample size 100,000 with no missing data, calculating inverse probability of treatment weights using boosted classification and regression trees,41 and using the stabilized weights in a Cox model. We chose boosted classification and regression trees for its ability to identify and accommodate complex relationships between predictors and treatment. Due to its computational cost, employment of this method in our primary analyses was not feasible. Estimated true marginal hazard ratios were 0.53, 0.93, and 1.02. Bias, MSE, and confidence interval coverage probabilities were calculated relative to these true marginal hazard ratios.

In addition, we report parameter estimates and model-based standard errors averaged across the 1000 simulation iterations and the empirical standard error calculated as the SD of the estimated effects across the simulation iterations.

Finally, we compared the processing time required by each method using a plasmode simulated data set with 100,000 subjects on a Windows 10 virtual desktop with 8GB RAM and a 2.6GHz Intel(R) XEON(R) CPU E5-2690 v4. All process were run using a single-processor thread.

Statistical Programs and Packages

All analyses were conducted in R version 4.0.5 (R Foundation for Statistical Computing, Vienna, Austria). Outcome simulation used the Plasmode package version Data removal and multiple imputation were conducted using mice version We calculated boosted classification and regression tree propensity scores using the twang package version 2.3.42 All simulation and analytic code are available at:


Originating Data

In total, 7309 patients in the Flatiron Health database initiated first-line treatment for advanced bladder cancer between January 2011 and October 2021 (Table 1). Twenty-five percent of patients received first-line immunotherapy. Patients were predominantly White (76%), male (74%), and reported having ever smoked (73.3%). As expected, ECOG PS had the highest proportion of missing data (35%). Other variables with missing data included the following: race–ethnicity (8%), smoking status (1%), and gender (0%, 2 patients). After excluding the 2948 patients with missing data (40%), the final input data for the simulations included 4361 patients. Patients treated with immunotherapy experienced poorer overall survival than patients treated with nonimmunotherapy (Figure 1).

TABLE 1. - Summary of the Electronic Health Record-derived Bladder Cancer Cohort Before and After Excluding Observations with Missing Data
Variable Category Full Data Complete Cases
Total 7,309 4,361
Therapy Immunotherapy 1,820 (25%) 1,221 (28%)
Other 5,489 (75%) 3,140 (72%)
Gender Male 5,374 (74%) 3,164 (73%)
Female 1,933 (27%) 1,197 (27%)
Missing 2
Race/ethnicity Black 315 (5%) 204 (5%)
Hispanic 281 (4%) 179 (4%)
White 5,076 (76%) 3,259 (75%)
Other 1,052 (16%) 719 (17%)
Missing 585
ECOG PS 0 1,666 (35%) 1,547 (36%)
1 2,099 (44%) 1,909 (44%)
2–4 1,006 (21%) 905 (21%)
Missing 2,538
Surgery No 3,707 (51%) 2,148 (49%)
Yes 3,602 (49%) 2,213 (51%)
Ever smoker No 1,930 (27%) 1,160 (27%)
Yes 5,288 (73%) 3,201 (73%)
Missing 91
Age Median (IQR) 73 (66, 78) 73 (66–78)
Censoring month Median (IQR) 10 (5, 20) 10 (5–20)
IQR, interquartile range.

Baseline survival observed in original full data set by treatment group.


Bias in estimated hazard ratios, MSEs, and 95% confidence interval coverage probabilities for each missing data approach, true effect strength, missing data proportion, and missing data mechanism are presented in Figure 2 (bias), Figure 3 (mean squared error), Figure 4 (coverage), and eTable 2; (numeric results).

Bias in marginal hazard ratio estimates by missing data approach, missing data mechanism, simulated effect strength, and proportion of observations with missing data. Within each box, the dot represents estimated bias. The 2.5th, 25th, 50th, 75th, and 97.5th percentiles of the empirical distribution of the difference between true and estimated hazard ratios are given by the lower bound of the bar, lower bound of the box, central line, upper bound of the box, and upper bound of the bar, respectively.
Mean squared error by missing data approach, missing data mechanism, simulated effect strength, and proportion of observations with missing data.
This shows the 95% confidence interval coverage probabilities by missing data approach, missing data mechanism, simulated effect strength, and proportion of observations with missing data.

When data were MCAR, all methods had low absolute bias (range, 0.01–0.05) with generally near-nominal (0.95) coverage probabilities (range, 0.88–0.96) across simulation scenarios. However, the MSE for complete case (range, 0.0016–0.0129) was consistently higher than for multiple imputation (range, 0.0014–0.0062) or PS calibration (range, 0.0014–0.0053).

When data were MAR, multiple imputation and PS calibration consistently outperformed complete case on all metrics, regardless of true effect strength or proportion missing. Absolute bias when using multiple imputation (range, <0.00–0.05) and PS calibration (range, 0.01–0.03) was lower than when using complete case (range, 0.03–0.15), with differences increasing when data were 50% or 70% missing. MSEs for multiple imputation (range, 0.0013–0.0065) and PS calibration (0.0014–0.0058) were lower than for complete case (0.0024–0.0391) across scenarios. Confidence interval coverage probabilities were consistently below nominal for complete case (range = 0.74, 0.89) and sometimes below nominal for multiple imputation (0.87, 0.96) and PS calibration (0.92, 0.96).

When data were MNAR, multiple imputation and PS calibration again outperformed complete case on all metrics across simulation scenarios. Absolute bias reached a maximum of 0.05 with multiple imputation and PS calibration compared with a maximum of 0.10 with complete case. Again, MSE remained higher for complete case (range, 0.0021–0.0277) than multiple imputation (0.0014–0.0063) or PS calibration (0.0013–0.0073). Confidence interval coverage for PS calibration (range = 0.86, 0.96) and multiple imputation (range = 0.88, 0.96) dropped slightly below nominal at 50% and 70% missingness, and coverage for complete case (range = 0.85, 0.93) remained consistently below nominal.

Empirical standard errors and model-based standard errors approximated one another across missing data approaches and simulation scenarios (eTables 3, 5, and 7;

Outcome Misclassification

Under independent outcome misclassification, effect estimates generally shifted toward the null (eFigure 5; and eTable 4;, and we observed similar shifts under misclassification dependent on ECOG PS (eFigure 7; and eTable 6; Under MAR and MNAR, 95% confidence interval coverage probabilities improved for complete case (range = 0.83, 0.96) but deteriorated for multiple imputation and PS calibration (range = 0.65, 0.96) compared with simulations without misclassification (eFigures 6 and 8; These patterns of improvement and deterioration were expected with an overall shift toward the null given that prior estimates for multiple imputation and PS calibration exhibited minimal bias, while estimates for complete case tended to overestimate effect strength.

Processing Time

Multiple imputation required more computational resources than complete case or PS calibration when run on large datasets (N = 100,000) (Table 2). Considering all missingness mechanisms and levels of missingness, multiple imputation took between 8 minutes 47 seconds and 10 minutes 7 seconds to complete. Complete case analysis and propensity score calibration took a maximum of 16 seconds.

TABLE 2. - Time to run Models Using a Plasmode Simulated data set with 100,000 Subjects on a Windows 10 Virtual Desktop with 8GB RAM and a 2.6GHz Intel(R) XEON(R) CPU E5-2690 v4
Runtime (seconds)
Simulation Hazard Ratio % Missing Complete Case Multiple Imputation Propensity Score Calibration
1 0.1 12 598 16
1 0.5 4 570 15
1 0.7 1 534 14
0.9 0.1 16 607 14
0.9 0.5 6 564 14
0.9 0.7 3 536 13
0.5 0.1 11 604 14
0.5 0.5 3 562 13
0.5 0.7 1 527 13
All processes run with a single thread.

Effect Estimates in Originating Data

eTable 8; provides treatment hazard ratios generated by each of the missing data approaches when applied to the originating (nonsimulated) data. Estimates of the hazard ratio when using complete case analysis (1.18; 95% CI = 1.05, 1.32), multiple imputation (1.22; 95% CI = 1.11, 1.34), and PS calibration (1.20; 95% CI = 1.09, 1.31) were similar, but complete case had a wider confidence interval. Dissimilarity between complete case and other approaches may have been greater in our simulations than in the originating data due to the association between outcome and treatment with missingness induced by the simulation approach.


Missingness in confounder data presents a challenge to the validity of EHR-based comparative effectiveness analyses. To address this challenge, our study investigated the performance of multiple imputation and PS calibration approaches to accounting for this missingness when using IPTW to estimate marginal effects. In the presence of MAR or MNAR confounder data in a setting resembling an EHR-based investigation of treatment for bladder cancer, both multiple imputation and PS calibration exhibited minimal bias, even when the proportion of patients with missing data was ≥50%. These findings provide evidence that both methods are suitable for use in EHR-based IPTW comparative effectiveness research with high proportions of missing data in key confounder variables, similar to the setting used to generate our plasmode-simulated data. Given the comparable performance of multiple imputation and PS calibration in our simulations and the considerably greater computational resources required for multiple imputation, our findings also support use of PS calibration when large data sets and large numbers of confounders with missing data make multiple imputation infeasible.

A complete case IPTW approach to dealing with MAR or MNAR confounder data induces selection bias by systematically excluding patients based on the value of variables (confounders) that are associated both with treatment and outcome.18,39 Multiple imputation and PS calibration approaches avoid selection bias but maintain the fundamental IPTW assumption that treatment assignment is independent of baseline covariables after weighting.43,44 With this baseline independence in mind, other key assumptions to the performance of multiple imputation and PS calibration in our simulations include the following: (1) a correctly specified propensity score model, (2) no unmeasured confounding, and (3) no treatment effect modification. Given these key assumptions, we expect multiple imputation to perform well if imputation models accurately capture the distribution of the missing data. We expect PS calibration to perform well, if the error-prone propensity score is a surrogate for the gold standard propensity score.20

Our results showing minimal bias of multiple imputation and PS calibration under MNAR missingness in confounders are counterintuitive and may depend on our assumptions of no effect modification, no unmeasured confounding, and no missing treatment or outcome.45 Furthermore, observed performance under MNAR may be dependent on the strength of the confounder and the strength of the MNAR mechanism. Our simulations examined a moderately strong confounder with a strong MNAR mechanism. In the absence of missing data, absolute bias in unadjusted models ranged from 0.08 to 0.14 (eTable 2; The parameter for the ECOG PS effect in the outcome simulation model was equivalent to hazard ratios of 1.38 and 2.61 for ECOG PS = 1 or 2 and ECOG PS = 3 or 4 versus 0, respectively (eFigure 1; Additionally, ECOG PS = 1 or 2 had twice the weight and ECOG PS = 3 or 4 had 4 times the weight of other predictors in the MNAR model. Although we believe these are reasonable parameter values for an MNAR confounder in EHR data, future studies that examine a range of confounder and MNAR mechanism strengths may be better suited to evaluate sensitivity of missing data methods to bias when there is MNAR missingness in confounder data.

MI approaches to missing data in IPTW-comparative effectiveness research are simultaneously well-studied and underutilized. Three articles published between 2019 and 2021 alone focused on the relative performance of across verses within approaches to MI in IPTW-based analyses.16–19 Nonetheless, applied research lags behind methodologic research, and a systematic literature review of propensity score-based comparative effectiveness research found that 55% of articles published between 2016 and 2017 did not state what missing data approach was used, 29% used complete-case analysis, and 16% used multiple imputation.15 One aim of our study was to bridge this gap between theory and application by examining and discussing the performance of MI and PSC using simulations grounded in real-world data. Indeed, IPTW analyses that applied MI or PSC to address missingness in our simulations exhibited minimal bias at 70% missingness, directly addressing conventional wisdom that EHR-based studies provide unreliable results because of the large proportion of missingness in confounder data.


A weakness of our study is that simulated data were sampled excluding all patients from the original dataset with missing data. This meant that while some intervariable relationships were preserved, others may have been induced via selection bias. Furthermore (and more broadly), nongeneralizability is a limitation of all simulation-based investigations. Results of our simulations, which were based on a specific real-world dataset, may not be generalizable to other real-world datasets. We believe the plasmode simulation approach is more broadly generalizable than a purely synthetic approach, but we recommend that investigators carefully consider the context of their own data (with particular attention to strongly MNAR variables) when applying these methods.

Although PS calibration performed well in our simulations, there are situations where we expect it to perform poorly. First, the method assumes error-prone propensity scores are surrogates for gold-standard propensity scores. If this surrogacy assumption is violated, we can no longer expect weights derived from the calibrated error-prone propensity scores to adequately balance the characteristics of the treated and untreaded patients. If many important treatment predictors have missing data, they drop out of the error-prone model (which only includes predictors with complete data), reducing its complexity and increasing the chances of violating the surrogacy assumption. We also assume that the gold standard propensity score model–which fit using only complete cases–generalizes to the rest of the data set. In practice, one cannot validate either the surrogacy or generalizability assumptions, leaving residual uncertainty about the validity of results that employ the PS calibration approach in applied analyses.


In the presence of MCAR, MAR, or MNAR confounder data resembling a real-world study of treatment for bladder cancer, IPTW analysis with multiple imputation or propensity score calibration produced minimally biased treatment effect estimates. We recommend multiple imputation and propensity score calibration as approaches to dealing with missingness in MCAR or MAR confounder data when conducting an EHR-based IPTW comparative effectiveness analysis similar to the motivating study investigated here, even when the proportion of missing data is high. In large EHR-derived datasets, propensity score calibration may be preferred due to its greater computational efficiency.


1. Henry J, Pylypchuk Y, Searcy T, Patel V. Adoption of Electronic Health Record Systems among U.S. Non-Federal Acute Care Hospitals: 2008-2015. Office of the National Coordinator for Health Information Technology; 2016.
2. Parasrampuria S, Henry J. Hospitals Use of Electronic Health Records Data, 2015-2017. Vol 46. Office of the National Coordinator for Health Information Technology; 2019.
3. 114th Congress. 21st Century Cures Act - Public Law 114–255. Natl Institutes Heal Reauthorization Sec. 2016;13:1157−1188.
4. Lye CT, Forman HP, Daniel JG, Krumholz HM. The 21st Century Cures Act and electronic health records one year later: will patients see the benefits? J Am Med Informatics Assoc. 2018;25:1218–1220.
5. Fowles JB, Weiner JP. Electronic health records and the reliability and validity of quality measures: a review of the literature. Med Care Res Rev. 2010;67:503–527.
6. Sadetsky N, Chuo CY, Davidoff AJ. Development and evaluation of a proxy for baseline ECOG PS in advanced non-small cell lung cancer, bladder cancer, and melanoma: An electronic health record study. Pharmacoepidemiol Drug Saf. 2021;30:1233–1241.
7. National Academies of Sciences: Engineering and Medicine, Board on Health Care Services, Committee on Health Care Utilization and Adults with Disabilities. Health-Care Utilization as a Proxy in Disability Determination. National Academies Press; 2018. doi:10.17226/24969
8. Taber JM, Leyva B, Persoskie A. Why do people avoid medical care? A qualitative study using national data. J Gen Intern Med. 2015;30:290–297.
9. Feld E, Harton J, Meropol NJ, et al. Effectiveness of first-line immune checkpoint blockade versus carboplatin-based chemotherapy for metastatic urothelial cancer. Eur Urol. 2019;76:524–532.
10. Oken MM, Creech RH, Tormey DC, et al. Toxicity and response criteria of the Eastern Cooperative Oncology Group. Am J Clin Oncol. 1982;5:649–655.
11. Takvorian SU, Parikh RB, Vader D, et al. Impact of COVID-19 pandemic on time to treatment initiation for patients with advanced cancer. J Clin Oncol. 2021;39(15_suppl):1528–1528.
12. Cepeda MS, Boston R, Farrar JT, Strom BL. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol. 2003;158:280–287.
13. Austin PC. The use of propensity score methods with survival or time-to-event outcomes: Reporting measures of effect similar to those used in randomized experiments. Stat Med. 2014;33:1242–1258.
14. Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci. 1999;14:29–46.
15. Malla L, Perera-Salazar R, McFadden E, Ogero M, Stepniewska K, English M. Handling missing data in propensity score estimation in comparative effectiveness evaluations: a systematic review. J Comp Eff Res. 2018;7:271–279.
16. Ling A, Montez-Rath M, Mathur M, Kapphahn K, Desai M. How to apply multiple imputation in propensity score matching with partially observed confounders: a simulation study and practical recommendations. J Mod Appl Stat Methods. 2021;19:1–65.
17. Granger E, Sergeant JC, Lunt M. Avoiding pitfalls when combining multiple imputation and propensity scores. Stat Med. 2019;38:5120–5132.
18. Leyrat C, Seaman SR, White IR, et al. Propensity score analysis with partially observed covariates: How should multiple imputation be used? Stat Methods Med Res. 2019;28:3–19.
19. Mitra R, Reiter JP. A comparison of two methods of estimating propensity scores after multiple imputation. Stat Methods Med Res. 2016;25:188–204.
20. Stürmer T, Schneeweiss S, Avorn J, Glynn RJ. Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. Am J Epidemiol. 2005;162:279–289.
21. Stürmer T, Schneeweiss S, Rothman KJ, Avorn J, Glynn RJ. Performance of propensity score calibration: a simulation study. Am J Epidemiol. 2007;165:1110–1118.
22. Lin HW, Chen YH. Adjustment for missing confounders in studies based on observational databases: 2-stage calibration combining propensity scores from primary and validation data. Am J Epidemiol. 2014;180:308–317.
23. Spiegelman D, McDermott A, Rosner B. Regression calibration method for correcting measurement-error bias in nutritional epidemiology. Am J Clin Nutr. 1997;65:1179S–1186S.
24. Fraser GE, Stram DO. Regression calibration in studies with correlated variables measured with error. Am J Epidemiol. 2001;154:836–844.
25. Spiegelman D, Schneeweiss S, McDermott A. Measurement error correction for logistic regression models with an “Alloyed Gold Standard.” Am J Epidemiol. 1997;145:184–196.
26. Zhang Q, Gossai A, Monroe S, Nussbaum NC, Parrinello CM. Validation analysis of a composite real-world mortality endpoint for patients with cancer in the United States. Health Serv Res. 2021;56:1281–1287.
27. Franklin JM, Schneeweiss S, Polinski JM, Rassen JA. Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases. Comput Stat Data Anal. 2014;72:219–226.
28. Vaughan LK, Divers J, Padilla MA, et al. The use of plasmodes as a supplement to simulations: A simple example evaluating individual admixture estimation methodologies. Comput Stat Data Anal. 2009;53:1755–1766.
29. Ma X, Long L, Moon S, Adamson BJS, Baxi SS. Comparison of population characteristics in real-world clinical oncology databases in the US: flatiron health, SEER, and NPCR [PREPRINT]. medRxiv. 2020. doi:10.1101/2020.03.16.20037143.
30. Birnbaum B, Nussbaum N, Seidl-Rathkopf K, et al. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research. 2020.
31. Parikh RB, Min EJ, Wileyto EP, et al. Uptake and survival outcomes following immune checkpoint inhibitor therapy among trial-ineligible patients with advanced solid cancers. JAMA Oncol. 2021;7:1843.–11850.
32. Breslow NE. Analysis of survival data under the proportional hazards model. Int Stat Rev/ Rev Int Stat. 1975;43:45–57.
33. Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Stat Med. 2005;24:1713–1723.
34. Schouten RM, Lugtig P, Vink G. Generating missing values for simulation purposes: a multivariate amputation procedure. J Stat Comput Simul. 2018;88:2909–2930.
35. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:4141.–41455.
36. Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–560.
37. van Buuren S, Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1−67.
38. White IR, Royston P. Imputing missing covariate values for the Cox model. Stat Med. 2009;28:1982–1998.
39. Rubin DB. Multiple Imputation for Nonresponse in Surveys. (Rubin DB, ed.). John Wiley & Sons, Inc.; 1987. doi:10.1002/9780470316696
40. Rudolph KE, Stuart EA. Using sensitivity analyses for unobserved confounding to address covariate measurement error in propensity score methods. Am J Epidemiol. 2018;187:604–613.
41. Ridgeway G, Mccaffrey D, Morral A, Burgette L, Griffin BA. Toolkit for weighting and analysis of nonequivalent groups: a tutorial for the twang package. Rand. 2017:1–30.
42. Ridgeway G, McCaffrey D, Morral A, Griffen BA, Burgette L. twang: Toolkit for Weighting and Analysis of Nonequivalent Groups. R package version 1.5; 2017.
43. Joffe MM, Ten Have TR, Feldman HI, Kimmel SE. Model selection, confounder control, and marginal structural models. Am Stat. 2004;58:272–279.
44. Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Stat Med. 2015;34:3661–3679.
45. Choi J, Dekkers OM, le Cessie S. A comparison of different methods to handle missing data in the context of propensity score analysis. Eur J Epidemiol. 2019;34:23–36.

Comparative effectiveness research; Data quality; Electronic health records; Inverse probability of treatment weighting; Missing data; Propensity scores; Statistical bias

Supplemental Digital Content

Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc.