Confounding bias is a formidable challenge in observational research. One solution to reduce this vulnerability is to conduct randomized trials; however, such experiments frequently are infeasible, unethical, or irrelevant to the question posed. Another method to address confounding uses statistical controls for variables that may bias the exposure-outcome relationship. This method generally manifests as variations of two conceptual approaches: covariate adjustments and propensity scores. The more traditional approach is covariate adjustment, where a predictive model for the outcome attempts to parse the effect independently attributable to several “predictors.” Propensity scores were first introduced in 1983 by Rosenbaum and Rubin (^{1}), and at the time reflected a novel approach to problems of confounding in observational research. Since then, particularly in recent years, propensity score methods, and specifically the propensity score matching (PSM) technique, have become increasingly prevalent in clinical research. Critical care research proves no exception (^{2}); this journal published more than 10 articles utilizing propensity score methods in 2017 alone (^{3–12}).

We will briefly review what propensity scores do, and then examine 10 pearls and pitfalls important for interpretation from the perspective of evidence-based care (Table 1). The end-users of this article are research savvy clinicians rather than seasoned biostatisticians; I endeavor to avoid the detailed statistical mechanics of propensity scores, which are reviewed elsewhere (^{13}). Instead, this report discusses propensity concepts this author finds commonly underappreciated among clinicians. Many of these can be illustrated by the report by Tsai et al (^{14}) in this issue of *Critical Care Medicine*.

## STEROIDS IN CARDIAC ARREST SURVIVORS: WHAT THE AUTHORS DID

In their article, Tsai et al (^{14}) asked whether steroid administration improved survival for patients who experienced nontraumatic cardiac arrest and underwent successful cardiopulmonary resuscitation in the emergency department. To answer the question, they leveraged a national database of nearly 20,000 patients and constructed a one-to-one propensity-matched cohort of 5,445 matched pairs. They performed multivariable regression within the matched cohort to estimate the effect of postarrest steroids on mortality. They also present subgroup analysis to test whether baseline factors modify the effect of steroids and similar analysis stratifying treatment by total steroid dose given in the month after arrest. They found steroid-treated patients survived more often than nontreated controls, and concluded, “Post-arrest steroid use is associated with better survival to hospital discharge and 1-year survival.” We can use their study to review how this method works and highlight considerations for interpretation.

## TRADITIONAL COVARIATE ADJUSTMENT AND OVERFITTING

A traditional approach to confounding uses multivariable regression to adjust for covariates. In covariate adjustment, multiple factors are included in a single model to assess their association with an outcome independent of each other. (**Appendix 1** offers an abridged overview of multivariable covariate adjustment). In predictive models such as logistic regression, we define overfitted models as having more predictor variables than necessary. Overfitting is a violation of the rule of parsimony (^{15}), sometimes called “Occam’s Razor,” which a heuristic principle that states among several hypotheses, the one making fewest assumptions is most likely to be correct. Each additional variable represents an added set of assumptions, and unduly complex models may describe associative relationships that are in reality attributable to random variation. **Appendix 2** (Fig. 1) provides an illustrative hypothetical example of how this occurs.

Overfitting becomes problematic for clinical research, which attempts to discern relationships for variables independent of confounding influence. Omitting important confounders can bias results, but including too many variables may also be inaccurate if doing so attributes meaning to random noise. Overfitting has long been problematic in medical literature (^{16}) and likely contributes to irreproducibility in science (^{17}). Therefore, observational studies must balance maximizing confounder adjustment and minimizing overfitting risk.

Although hardly the only concern with traditional covariate adjustment, overfitting is important to understand because it is the primary limitation that propensity techniques may be able to overcome.

## WHAT PROPENSITY SCORES DO

Figure 2 illustrates conceptual differences between propensity and covariate adjustments. The propensity score is the probability a subject will be allocated to exposure versus control, given a set of independent predictor variables (^{1}). Let us consider the article by Tsai et al (^{14}). One can imagine a traditional regression model, where potential confounders are included, but instead of testing their association with mortality (the outcome), we instead test their association with steroid administration (the exposure). The model estimates how each covariate influences propensity for receiving steroids. We can then calculate any individual patient’s probability of receiving steroids given their status with respect to the variables used to determine propensity estimates. For example, asthma was associated with roughly 1.5× (50%) increased steroid propensity, so for two otherwise identical patients where only the second had asthma, the steroid “propensity score” might be 20% for the first patient and 30% for the second. Note that steroid propensity is not the same as actually receiving steroids. Quantifying propensity differences facilitates comparisons (steroids vs no steroids) correcting for how likely subjects were to be in one group versus another. Ideally, this approximates the probability environment of randomized trials, where patients are equally likely to be in either group.

Several procedures can adjust for propensity (^{18}). Tsai et al (^{14}) use PSM. PSM matches individuals in the exposure group to a control patient with equal propensity score (Fig. 3). Alternative approaches include propensity score stratification, inverse-probability weighting (IPW), and propensity adjustment. Stratification groups patients by propensity scores. IPW is more complex. IPW weights subjects who receive unexpected allocations more heavily—for example, a steroid-receiving patient with 10% steroid propensity is weighted more heavily versus another steroid patient with 50% propensity. IPW has the benefit of allowing correction across more than two groups, but comes with the trade-off of less stability (^{19}), meaning, and outliers more easily distort estimates. Propensity adjustment refers to including propensity score as an additional covariate in traditional models. Combining propensity and covariate adjustments is called “doubly robust estimation (DRE)” (^{20}). DRE requires larger datasets but may produce superior estimates versus either propensity or covariate adjustments alone (^{20}), since obtaining reliable estimates theoretically requires only one of the two models to be correctly specified. Tsai et al (^{14}) generated a PSM cohort for steroid use and performed multivariable regression in the matched cohort, so we will focus on PSM and DRE.

## PEARLS AND PITFALLS

With this foundation, we can now reflect on unique considerations for clinical propensity studies.

### Pearl: Propensity Scores Are Equally Reliable As Traditional Covariate Adjustments

Empirical evidence consistently indicates propensity-based methods for adjustment perform comparably to covariate adjustment and produce similar estimates of effect overall (^{19} ^{, } ^{21–23}). This has been shown for various applications of propensity scores, including PSM, IPW, propensity score stratification, and adjustment for propensity score as a covariate (^{19}). Authors claiming PSM as a strength rather than limitation of their study should meet the same scrutiny that similar claims about multivariable adjustment would. Tsai et al (^{14}) appropriately discuss this in their study limitations.

### Pitfall: Propensity Scores Do Not Mitigate Unobserved Biases

Relatedly, we must remember that propensity methods only address observed bias. Misconceptions otherwise probably stem, at least partially, from PSM’s emulation of randomized trials’ probability environment. This perception is categorically false.

The most optimal propensity method applied to ideal datasets in perfectly suited scenarios still only corrects for observed bias. Tsai et al (^{14}) matched on whether subjects had chronic kidney disease (CKD). The CKD frequency appears balanced between steroid and nonsteroid groups (10.6% vs 10.4%) in the matched cohort, so it might seem reasonable to think that this dimension is therefore adequately controlled. But what if CKD patients receiving steroids have average estimated glomerular filtration rate (eGFR) equal to 50 mL/min, but controls have eGFR equal to 25 mL/min? Clearly, CKD means something different between groups. In this case, controlling for a binary CKD variable will not sufficiently balance them, and the PSM cohort could in fact have significant selection bias. Unfortunately, these data were not available to investigators, so we cannot know for certain. This illustrates that even when we account for confounders, we can never truly ensure we sufficiently captured all the needed information. Even documented variables may not possess adequate granularity.

### Pearl: Propensity Scores, in Specific Circumstances, Are Extremely Useful

In some scenarios, propensity scores prove extremely useful in ways covariate adjustments are not. The most apparent is when outcomes occur uncommonly. Here, propensity scores have lower overfitting risk for any fixed number of potential confounders. The explanation is surprisingly simple. To estimate association between exposure and outcome, we assess the proportion of exposed patients experiencing outcome. By definition, there are fewer outcomes than exposures. If we want some minimum number of “events”-per-covariate in logistic models to maintain acceptably low risk of overfitting (e.g., a somewhat flawed convention is 10 events/variable) (^{24–27}), we can include more potential confounders when modeling propensity: there are more “events” when the event is the allocation rather than the outcome. Subsequently, propensity scores provide a single factor that we can correct for in determining treatment effect. Therefore, when outcomes are rare, techniques such as PSM can provide better estimates than traditional adjustment.

Another useful application of propensity scores is when suspected confounders exist in complex temporal patterns with respect to outcomes and exposures. For example, an exposure may lead to a confounding relationship that develops after allocation occurs. These situations are not implausible, such as if a particular treatment or diagnosis makes an ICU patient more likely to be discharged to a particular inpatient service versus controls. We cannot include discharge service when modeling propensity because this occurs after treatment starts (meaning, allocation is the independent variable, and discharge service is dependent). However, discharge to the service could nevertheless confound outcomes (e.g., if a particular service orders disproportionately frequent gentamicin, that might be important if we are interested in post ICU renal function). By using both propensity and traditional covariate adjustments in separate models, that is, DRE, we can attempt to disentangle this complicated scenario.

In the article by Tsai et al (^{14}), which had thousands of patients and roughly 20% survival, none of these scenarios are present. Thus, using propensity scores to perform DRE gave the authors another chance to correctly specify a model, but probably doesn’t add much beyond that.

### Pitfall: Pay Attention to Missing Data

Propensity scores may have heightened susceptibility to missing data (^{28} ^{, } ^{29}). Many models, including logistic regression used in calculating propensity scores, require complete datasets, (i.e., without any missing data). Consequently, researchers often drop patients with missing data from analysis—perhaps without even realizing, many statistical software programs exclude subjects with missing data by default. With any method, this clearly presents opportunity for bias if missing data are not missing completely at random.

More insidiously although, small subsets of patients randomly missing data can bias propensity scores for all patients. In the article by Tsai et al (^{14}), only four patients (of 19,229) had missing data, so excluding them is likely not a major concern. However, consider if instead 1,900 patients (10%) were missing at least one data-point, and then excluded from propensity modeling. Suppose many of these 1,900 were randomly missing seemingly less important variables (e.g., hypertension). However, if those patients with missing data included 100 more asthmatic patients in the steroid group than the nonsteroid group, asthma’s influence on propensity could be grossly misestimated. This could occur despite none of these subjects having missing asthma data, making it a potentially challenging problem to identify in the first place. As such, readers should always clarify the prevalence of missing data, as well as whether multiple imputation or other methods were used to better address this.

### Pearl: Matching Always Involves Trade-Offs Between Internal Validity and Generalizability

One of the most important concepts to appreciate when interpreting a study using PSM specifically is the method’s inherent trade-off between internal and external validity. PSM pairs equal propensity exposed subjects with controls. If none of the controls has equal propensity to an exposed subject, that subject remains unmatched and is excluded from the final cohort. That is, outside ideal populations where every exposed subject can be matched to an equal propensity control and vice versa, PSM always involves loss of information. Depending on the number and characteristics of unmatched subjects, this may or may not be important.

For example, 8,307 patients, more than 60%, in the study by Tsai et al (^{14}) who did not receive steroids were unmatched and excluded. This could be because they had low propensity, precluding their matching with patients who received steroids. When this occurs, PSM excludes patients unlikely to receive the intervention, although an analysis could just as easily go the other way, excluding patients with high propensities for treatment/exposure. This is not a value judgment. The analysis may well retain accuracy for its included population, and comparing groups with low propensity could still reveal important information. However, how this impacts generalizability remains critical to ascertain. In these scenarios, findings might not apply to the patients most likely to actually be treated/exposed. Whether that excludes the patients we are most interested in must be determined on a case-by-case basis.

We see an analogous issue in other reason so many nonsteroid patients are excluded from the article by Tsai et al (^{14}): the 1:1 matching ratio. This becomes relevant when an exposure occurs much more or less frequently than the control. A 1:1 rule, meaning exactly one steroid and nonsteroid patient for each matched pair, can maximize matches at any given caliper-width (see below). However, this excludes many patients with the more common exposure. Conversely, if the ratio changed, say to 1:2, this might reduce the total number of excluded patients. However, the total number of matches for any given caliper-width is likely to decrease, since each exposure now requires an additional equal-propensity control. Again the ideal approach to the trade-off is situation-specific.

In practical terms, authors can provide helpful information to evaluate this trade-off by including figures showing frequencies as a function of propensity score for exposed and control groups (akin to Fig. 3 and Fig. 4A). This allows readers to get a general sense of how many and why patients are excluded. An even more transparent approach is to include tables with descriptive characterizations of both matched and unmatched cohorts along with these figures.

### Pearl: Consider Caliper-Widths and Comparability of Matched Pairs

Unmatched subjects do not intrinsically damage internal validity in PSM, but the process of determining who goes matched or unmatched warrants additional considerations.

Until now, we allowed the assumption that all matched pairs have precisely equal propensity, but this is not always the case. When modeling continuous variables, similar patients may have minor propensity discrepancies. If we required perfect equivalence, we could not match otherwise identical patients with serum sodium concentrations of 138 and 139 mEq/L, respectively, despite this difference being physiologically irrelevant. Clearly, practicality necessitates PSM to permit some degree of variation. The simplest solution is nearest-neighbor matching, which matches exposed subjects to the control with the smallest propensity difference (other methods are beyond the scope of this article) (^{30}). This seems reasonable for small differences, but larger differences may produce bad matches—matching subjects with 55.2% and 55.3% propensity versus 75.0% and 95.0% propensity. In the latter case, the matched patients likely are not comparable. The most straightforward means of avoiding this problem imposes a threshold for the maximum propensity difference we will tolerate between matches. At a 5% threshold, the latter subjects above would remain unmatched. We call this maximum tolerated difference caliper-width.

Determining acceptable caliper-width beforehand can be difficult (^{31}). Procedure for selecting calipers remains subject to debate, as does the influence of calipers on treatment effect estimates (^{32–34}). Caliper appropriateness and importance are likely situation specific.

Why care about this? Bad matching undermines accuracy. Often, larger studies can employ extremely small calipers to mitigate vulnerability. When infeasible, readers must themselves decide whether matches were appropriate. The reasonableness of 10.0%, 1.0%, 0.1% calipers is influenced by study questions, populations, and the nature of the data (e.g., number of continuous vs categorical variables). More importantly, greater internal validity with narrow calipers requires balancing against lost generalizability from increased numbers of unmatched patients (Fig. 4, B and C). Large studies employing excessively narrow calipers may unnecessarily exclude patients.

Clearly, “closeness” of matches represents an important methodological element. A study could achieve excellent cohort matching with improperly wide calipers, potentially misrepresenting results, whereas setting unduly narrow calipers may miss opportunities for generalizable inference. The practical steps described in Pearl 5 (above) and 7 (below) can help with making this assessment.

### Pearl: Assessing Whether Matching Worked

To assess group comparability in practical terms, we should ask two separate questions: 1)”Did PSM successfully produce groups with balanced characteristics?” and 2)”Did PSM successfully resolve confounding?” Meaning, did PSM do what it purported to versus did it do what we actually care about.

The first is readily answered. If we look at frequencies and means of Table 1 variables in the matched groups, we can decide whether any differences have clinically relevant magnitude. Standardized differences and the variances of continues variables also provide helpful information (^{35}). *p* values are deeply problematic for such comparisons and should not be reported, let alone given weight. These *p* value issues are not unique to PSM and discussed extensively elsewhere (^{36–38}).

Observing balance of matching variables suggests that matching produced a cohort as specified but provides limited information about our second question. Although we cannot rule-out unobserved confounding, some tools may help amass “absence of evidence” in lieu of “evidence of absence.”

One option is to check analogous or higher resolution variables than those used in matching. For example, if matching on systolic blood pressure, analogous variables might be diastolic or mean arterial pressures. If matching on acute kidney injury, renal replacement therapy might be a higher resolution variable. This approach can be limited if data-points are unavailable or because, with sufficient sample size, authors may want to just use the higher resolution variable to match in the first place.

Another option is to use falsification endpoints. We can think of these as negative controls (^{39}). The authors prespecify variables that are expected to be unrelated to the exposure and assess if they differ after matching. In Tsai et al (^{14}), we have no logical reason to suspect patients getting steroids versus not should differ by orthopedic surgical history. Such differences, if present after PSM, suggest persistent unobserved differences and potential residual confounding. However, falsification outcomes are likely inadequately sensitive and also require authors to act transparently and in good faith.

### Pitfall: The Sample Size in PSM Is the Number of Matched Subjects

It might now seem obvious that the effective sample size of a PSM cohort is the number of patients in the final matched cohort. For example, Tsai et al (^{14}) propensity-matched 5,445 steroid to 5,445 nonsteroid patients. For inferences about steroids’ association with mortality, the *n* is 10,890, not the 19,299 from which their propensity score is derived. Although likely not an issue in the present study, this has implications for type-1 and type-2 error rates when we see a large difference between the denominators: for example, if only 50 pairs were matched out of the 19,299 in the Tsai population.

### Pitfall: Beware Subgroup Analyses in PSM

Pitfalls of subgroup analyses are myriad and a discussion unto themselves (^{40}). However, they prove particularly prone to mishandling in propensity studies.

Subgrouping after matching distorts comparability. The article by Tsai et al (^{14}) unsurprisingly reports imbalance in the derivation population: 30% more chronic obstructive pulmonary disease (COPD) and 60% more asthma among steroid than nonsteroid patients. This difference appears to resolve after matching. However, the authors then divide their matched population into quartiles based on total steroid dosage. Looking at these subgroups, we can see discrepancy in comorbidities reemerge as dose increases from low-to-high: across quartiles, 3%, 5%, 7%, and 14% for asthma. Meaning, by stratifying on dose, we again observe biases that PSM originally resolved. These four subgroups likely are not comparable to each other and are also each no longer comparable to the nonsteroid control group, which approximates the steroid group overall. These biases compound with the exclusion of patients who did not make it into the original matched cohort. Thus, looking at this subgroup provides even less internal validity and generalizability than looking at unadjusted data.

### Pitfall: Confounding Versus Endogeneity

Finally, propensity scores cannot adjust for endogeneity. Endogeneity, sometimes erroneously referred to as “confounding by indication,” is mathematically distinct from confounding. Briefly, two variables are endogenous if they are mutually deterministic: that is, X causes Y, but Y also causes X. This is different from confounding, a third-variable problem, where X is associated with Z, and Z is also associated with Y, but there is no deterministic property of Y on X. For example, COPD might confound the relationship between postarrest steroids and mortality: COPD patients would more often get steroids and more often die. However, the decision to give steroids is driven by COPD, not by mortality-risk after arrest. Conversely, in the article by Tsai et al (^{14}), the authors offer little information about patients’ physiologic status after resuscitation. Suppose clinicians more often gave steroids to improving patients and withheld them when patients appeared more severely ill. In this case, steroids and mortality would be endogenous, because “prior” information about the likelihood of the outcome influences whether patients receive the exposure. Observational research, can (almost) (^{41}) never address endogeneity: propensity scores are no exception. There is an important dissonance here because propensity scores are often deployed in hopes of addressing precisely this problem. Nearly any observational study comparing trauma or resuscitation treatments will encounter endogeneity because decisions to give treatments usually reflect holistic clinical assessments of patients’ acute status.

## SUMMARY/CONCLUSIONS

Propensity scores are potentially useful tools in observational research. They are distinct from traditional covariate adjustment in that they attempt to disentangle confounders from exposure allocation rather than from outcome effects. Although they can provide value by reducing overfitting risk and loaning nuance to statistical modeling in select situations, propensity scores are generally not any more reliable than traditional adjustment. Further, propensity scores, and particularly PSM, obligate additional considerations that are less important for traditional covariate adjustments. Clinicians and researchers who read or perform propensity score studies should remember, propensity cannot account for unobserved bias or endogeneity, always involve lost information, are susceptible to missing data, and always involve trade-offs between internal and external validity. These issues are particularly relevant to critical care clinical research, and practitioners should appreciate these considerations when interpreting evidence derived from propensity scores.

## ACKNOWLEDGMENTS

The author thanks Drs. Timothy G. Buchman and David M. Maslove for thoughtful comments on earlier drafts of this article, continued opportunity, and support.

## REFERENCES

1. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70:41–55

2. Gayat E, Pirracchio R, Resche-Rigon M, et al. Propensity scores in intensive care and anaesthesiology literature: A systematic review. Intensive Care Med 2010; 36:1993–2003

3. Wiewel MA, van Vught LA, Scicluna BP, et al; Molecular Diagnosis and Risk Stratification of Sepsis (MARS) Consortium: Prior use of calcium channel blockers is associated with decreased mortality in critically Ill patients with sepsis: A prospective observational study. Crit Care Med 2017; 45:454–463

4. Balakumar V, Murugan R, Sileanu FE, et al. Both positive and negative fluid balance may be associated with reduced long-term survival in the critically Ill. Crit Care Med 2017; 45:e749–e757

5. Emrath ET, Fortenberry JD, Travers C, et al. Resuscitation with balanced fluids is associated with improved survival in pediatric severe sepsis. Crit Care Med 2017; 45:1177–1183

6. Beesley SJ, Wilson EL, Lanspa MJ, et al. Relative bradycardia in patients with septic shock requiring vasopressor therapy. Crit Care Med 2017; 45:225–233

7. Krannich A, Leithner C, Engels M, et al. Isoflurane sedation on the ICU in cardiac arrest patients treated with targeted temperature management: An observational propensity-matched study. Crit Care Med 2017; 45:e384–e390

8. Lemiale V, Resche-Rigon M, Mokart D, et al. High-flow nasal cannula oxygenation in immunocompromised patients with acute hypoxemic respiratory failure: A groupe de recherche respiratoire en réanimation onco-hématologique study. Crit Care Med 2017; 45:e274–e280

9. Morris JV, Ramnarayan P, Parslow RC, et al. Outcomes for children receiving noninvasive ventilation as the first-line mode of mechanical ventilation at intensive care sdmission: A propensity score-matched cohort study. Crit Care Med 2017; 45:1045–1053

10. Moss TJ, Calland JF, Enfield KB, et al. New-onset atrial fibrillation in the critically Ill. Crit Care Med 2017; 45:790–797

11. Trauer J, Muhi S, McBryde ES, et al. Quantifying the effects of prior acetyl-salicylic acid on sepsis-related deaths: An individual patient data meta-analysis using propensity matching. Crit Care Med 2017; 45:1871–1879

12. Mather JF, Corradi JP, Waszynski C, et al. Statin and its association with delirium in the medical ICU. Crit Care Med 2017; 45:1515–1522

13. Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res 2011; 46:399–424

14. Tsai MS, Chuang PY, Huang CH, et al. Post-arrest steroid use may improve outcomes of cardiac arrest survivors. Crit Care Med 2018. [Epub ahead of print]

15. Hawkins DM. The problem of overfitting. J Chem Inf Comput Sci 2004; 44:1–12

16. Walter S, Tiemeier H. Variable selection: Current practice in epidemiological studies. Eur J Epidemiol 2009; 24:733–736

17. Ioannidis JP. Why most published research findings are false. PLoS Med 2005; 2:e124

18. Heinze G, Jüni P. An overview of the objectives of and the approaches to propensity score analyses. Eur Heart J 2011; 32:1704–1708

19. Elze MC, Gregson J, Baber U, et al. Comparison of propensity score methods and covariate adjustment: Evaluation in 4 cardiovascular studies. J Am Coll Cardiol 2017; 69:345–357

20. Funk MJ, Westreich D, Wiesen C, et al. Doubly robust estimation of causal effects. Am J Epidemiol 2011; 173:761–767

21. Arbogast PG, Ray WA. Performance of disease risk scores, propensity scores, and traditional multivariable outcome regression in the presence of multiple confounders. Am J Epidemiol 2011; 174:613–620

22. Shah BR, Laupacis A, Hux JE, et al. Propensity score methods gave similar results to traditional regression modeling in observational studies: A systematic review. J Clin Epidemiol 2005; 58:550–559

23. Austin PC, Schuster T, Platt RW. Statistical power in parallel group point exposure studies with time-to-event outcomes: An empirical comparison of the performance of randomized controlled trials and the inverse probability of treatment weighting (IPTW) approach. BMC Med Res Methodol 2015; 15:87

24. Harrell FE Jr, Lee KL, Califf RM, et al. Regression modelling strategies for improved prognostic prediction. Stat Med 1984; 3:143–152

25. Peduzzi P, Concato J, Kemper E, et al. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996; 49:1373–1379

26. van Smeden M, de Groot JA, Moons KG, et al. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol 2016; 16:163

27. van Smeden M, Moons KG, de Groot JA, et al. Sample size for binary logistic prediction models: Beyond events per variable criteria. Stat Methods Med Res 2018 Jan 1:962280218784726. [Epub ahead of print]

28. Haukoos JS, Newgard CD. Advanced statistics: Missing data in clinical research–part 1: An introduction and conceptual framework. Acad Emerg Med 2007; 14:662–668

29. Newgard CD, Haukoos JS. Advanced statistics: Missing data in clinical research–part 2: Multiple imputation. Acad Emerg Med 2007; 14:669–678

30. Caliendo M, Kopeinig S. Some practical guidance for the implementation of propensity score matching. J Econ Surv 2008; 22:31–72

31. Smith JA, Todd PE. Does matching overcome LaLonde’s critique of nonexperimental estimators? J Econom 2005; 125:305–353

32. Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharm Stat 2011; 10:150–161

33. Lunt M. Selecting an appropriate caliper can be essential for achieving good balance with propensity score matching. Am J Epidemiol 2014; 179:226–235

34. Wang Y, Cai H, Li C, et al. Optimal caliper width for propensity score matching of three treatment groups: A Monte Carlo study. PLoS One 2013; 8:e81045

35. Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009; 28:3083–3107

36. Mark DB, Lee KL, Harrell FE. Understanding the role of p values and 7x hypothesis tests in clinical research. JAMA Cardiol 2016; 1:1048–1054

37. Wasserstein RL, Lazar NA. The ASA’s statement on p-values: Context, process, and purpose. Am Stat 2016; 70:129–133

38. Goodman S. A dirty dozen: Twelve p-value misconceptions. Semin Hematol 2008; 45:135–140

39. Prasad V, Jena AB. Prespecified falsification end points: Can they validate true observational associations? JAMA 2013; 309:241–242

40. Groenwold RH, Donders AR, van der Heijden GJ, et al. Confounding of subgroup analyses in randomized data. Arch Intern Med 2009; 169:1532–1534

41. Iwashyna TJ, Kennedy EH. Instrumental variable analyses. Exploiting natural randomness to understand causal mechanisms. Ann Am Thorac Soc 2013; 10:255–260

## APPENDIX 1: MULTIVARIABLE LINEAR REGRESSION, GROSSLY ABRIDGED

Linear regression produces an equation specifying the value of the outcome variable at given values of the included ‘predictor’ variables, and an error-term quantifying outcome variation the model cannot explain. That is,

Where Y is the outcome, X is the value of a predictor variable, β is the coefficient of that predictor, and μ is unexplained error/variation. For example, in a simplified conceptual model for the study by Tsai et al (^{14}), Y = mortality-risk, X_{1} = steroids, β_{1} = the slope of the straight line that relates steroid use to mortality risk. (Note that this implicitly assumes the relationship between steroids and survival is linear: a substantial assumption). X_{2} might be a third variable, such as age, that we believe may confound the relationship of X_{1} (steroids) to Y_{1} (mortality-risk). If we assume that X_{1} and X_{2} are independent of each other—yet another major and often underappreciated assumption in these models—we can consider a patient’s mortality risk to be the sum of the two predictors’ effects. Within the confines of what the data and our theoretical framework for the question permit, we may continue to include additional predictor variables to the equation in this fashion to yield a single comprehensive model. Therefore, by quantifying the value of the β-coefficients, the equation also describes the effect attributable to any single predictor variable separate from the others.

## APPENDIX 2: OVERFITTING

Figure 1 illustrates an example of overfitting. Consider a hypothetical study testing whether patients’ baseline oxygen saturation (SaO_{2}) predicts the number of mechanical ventilation days. When ventilation days are plotted as a function of baseline SaO_{2}, we observe the distribution in Figure 1A. If we allow some degree of unexplainable or random variation (rather than making assumptions about explanations for this variation), we might attribute the two outlying patients to statistical noise (*arrows*) and find a positive linear correlation (Fig. 1B). If we then ask this model whether a new patient will require more than a week of mechanical ventilation, the model would predict greater or lesser than a week if baseline SaO_{2} was less or more than 82%, respectively (Fig. 1C). Conversely, if we refuse to accept any unexplained variation, we might come up the relationship described in Fig. 1D. This more complex model perfectly explains all the variation in the dataset used to derive it (by making more assumptions about SaO_{2}’s relationship to ventilation days). However, the model describes an illogical relationship, and if we generalize it to new patients, it would perform poorly (Fig. 1E).

This example is analogous to multivariable models that adjust for too many covariates. Every covariate added to a model increases the number of assumptions and ways the observed data can be explained. For example, in the overfitted model above (Fig. 1, *D* and *E*), one can imagine that every data-point represents a patient with a unique admission diagnosis. If one were to model the interaction between SaO_{2} and each individual diagnosis in this dataset, they would arrive at a similar result. As models become increasingly complex, they may become increasingly better at describing the data that derives them. However, this makes them more likely to attribute meaningful association to randomness and less likely to generalize well to new patients.