Introduction
The analysis of highquality randomized controlled trials offers the top level of evidence from clinical outcomes research investigation of therapeutic interventions. Yet there are many scenarios in which randomized controlled trials are inappropriate or impossible, such as for the study of rare conditions; moreover, the generalizability of randomized controlled trials is often limited by strict inclusion and exclusion criteria. Nonrandomized studies (or observational studies) can provide an important complementary source of information, provided that the data are analyzed and interpreted in the context of the confounding bias to which they are prone. This article will explain the special analytic considerations that are required for the proper reporting and interpretation of observational studies.
We begin by reviewing some intuitive, rather than technical, statistical principles that are fundamental to the understanding of what observational data can tell us. The relationship between a study sample and the target population is discussed, as this is the key to statistical inference. The notion of probability distributions is presented as a means of understanding the two primary forms of statistical analysis: estimation and hypothesis testing. The concept of bias, and confounding in particular, is then introduced as a major obstacle to drawing valid conclusions from an observational study. The discussion then focuses on techniques that are most useful in the analysis of the three most common types of observational studies: the case series, the therapeutic study, and the prognostic study. Each type will be introduced with a case example for consideration as basic principles are reviewed, followed by further explanation. The goal of this review is to empower the reader to understand a research question in the context of these categories so as to initiate a practical and valid approach to analysis.
Populations and Distributions
The analysis of any clinical study is based on the principle of taking a random or representative sample of subjects in order to draw some inference about a larger population of similar individuals called the target population (Fig. 1). However, going from a population to a sample leads to some degree of uncertainty or margin of error because of the need to rely on the use of estimation without knowledge of the entire population. To quantify this uncertainty, we rely on mathematically defined probability distributions, such as normal distribution for continuous data and binomial distribution for categorical data. Understanding these distributions is fundamental to statistical inference, and the reader is referred to a basic statistical text for more background^{1,}^{2}. These distributions are based on parameters such as mean and standard deviation. If the assumption is made that the observed data are a sample from a population with a distribution that has a known theoretical form, then it is reasonable to use parameters of that distribution (those observed) to calculate probabilities of different values occurring. This parametric approach to statistics is wideranging and ubiquitous in medical research. However, if these distributional assumptions are not realistic, then the parametric approach may not generate valid results. When data deviate from a socalled “normal” pattern, nonparametric or distributionfree methods should be used^{3}. Making the decision to use parametric or nonparametric methods is an important early step in the analysis of data and requires the analyst to understand the observed distribution of the data.
Estimation and Hypothesis Testing
Statistical analyses are of two general types: estimation and hypothesis testing. A primary objective of any of the three types of observational studies introduced above is to provide some numerical value that expresses the probability or average of a measured outcome (often expressed as a proportion or mean) or the relative effect associated with a specific treatment or prognostic factor (often expressed as a relative risk or odds ratio). Estimation covers a broad range of statistical procedures that yield the magnitude of risk or effect as well as the precision of that estimate. Hypothesis testing is a method for understanding the likelihood of observing a difference or association from data if no such relationship actually exists in the population. While these two concepts may appear similar and are often simultaneously generated by computerized statistical packages, they actually convey distinct information that ought to be well understood by researchers and those who strive to understand reported results.
Estimation typically involves two components. The first is the calculation of a point estimate of disease or outcome prevalence (typically expressed as a probability, rate, or mean) or effect (typically expressed as an odds ratio, relative risk, or risk difference). What is equally important to statistical inference is the variability or precision of this measure. Estimation gives us this quantity, typically in the form of a ^{confidence interval4,} which informs us of how large an error might be made with an estimate of effect. A confidence interval is a range of values that can be confidently relied on to include the true population value. Interpretation of a 95% confidence interval would include the range of values that contains the true population mean with a probability of 0.95. If, for example, one hundred random samples were drawn from a population in which the true mean is sixtyfive, the estimated 95% confidence interval of the mean of approximately ninetyfive of those samples should include sixtyfive (Fig. 2).
Hypothesis testing, while somewhat less intuitive than estimation, is used in the majority of reported results of statistical analysis in which comparisons are made. For hypothesis testing, we state a null hypothesis that the effect of interest is zero. This statistical null hypothesis is often the negation of the research hypothesis that generated the data (that is, there is no difference in the effect of treatment A compared with treatment B). We also have an alternative hypothesis, which is usually simply that the effect of interest is not zero. Having set up our null hypothesis, we can then, with use of a test statistic from a t test, chisquared test, or similar type of analysis, evaluate the probability that we could have obtained the observed data if the null hypothesis were true. This probability is usually called the p value; the smaller the p value, the more unlikely the null hypothesis. When a p value is below some arbitrary cutoff point (e.g., 0.05), the result is often called “significant.” The use of the word significant can lead to much confusion over what is significant and what is clinically significant. Because medical journals report many of their results with use of hypothesis testing, many restrict the use of the word “significant” to those results that meet the statistical definition. The use of cutoff points for p values leads to treating the analysis as a process of decisionmaking within which it is customary to consider a significant effect as real and a nonsignificant result as indicating no effect. Notice that this value gives no information about the magnitude of the effect or association that is being investigated. The decision about significance is also problematic because the uncertainty of the result is obscured (whereas it is explicit when a confidence interval is estimated). It is not reasonable to conclude that a nonsignificant result indicates no clinical effect just because the null hypothesis cannot be ruled out. The difference between a p value of 0.048 and 0.052 should not alter the conclusion about an association in the data. For these reasons, the approach based on estimation is often considered superior^{2,}^{5}; moreover, when hypothesis testing is performed, specific p values should be reported in their entirety, allowing readers to make up their own minds about whether the difference is clinically significant.
The threshold level at which a p value may be considered to be significant also depends on how many times one sample group is compared with another. The more often that a difference between two groups is searched for, the more likely it is that a difference will be found that has occurred purely by chance (that is, the typeI error rate increases)^{6}. This is an important consideration when two groups that are defined by the treatment that is received are compared with respect to multiple outcomes of interest or when multiple subgroup analyses are performed. Multiple testing is ideally performed to generate new hypotheses for future study, rather than trying to use the same data to try to definitively answer multiple questions. It is important that investigators be explicit about the primary outcome that their study was designed to investigate because this is typically what is used to calculate sample size and power and this is the comparison for which the standard hypothesis test is valid. When multiple comparisons are made and multiple hypothesis tests are undertaken with the same data, the cutoff p value for significance should be lowered accordingly with use of a Bonferroni or similar adjustment method to guard against typeI error^{7}.
Bias and Confounding
Properly conducted observational studies require a clear understanding of the role of bias in the data and how it ought to be handled. While there are multiple described forms of bias that threaten the validity of a clinical research study^{8}, most fall into one of three categories: selection bias, information bias, and confounding^{5}. Selection bias is defined as a distortion of estimates that results from the way in which subjects are selected into the study sample. Selection bias may arise due to flaws in the choice of groups to be compared or loss to followup during data collection (censoring). For example, if the likelihood of censoring is influenced by the choice of treatment or implant a patient receives, any relative estimate of effect will be biased by analyzing only those who receive followup. Information bias results in distortion of estimates due to measurement error or misclassification of subjects as to treatment, outcome, or other variables. This type of bias may result when the outcome of a particular treatment is assessed by a nonblinded surgeon, and, similar to selection bias, it is a threat to validity no matter what study design is used.
Confounding is a particular problem in observational studies. Confounding represents a mixing of effects between the treatment of interest and associated extraneous factors that also impact outcome, potentially obscuring or distorting the relationship of interest (Fig. 3). Confounding arises when patients selected for one treatment group are fundamentally different from the other group with regard to the pretreatment likelihood of having the outcome of interest. In surgery, treatment decisions are commonly made on the grounds of certain overt and subtle factors related to prognosis or severity of disease. For example, if patients with less severe systemic injuries are able to undergo fixation of longbone fractures at an earlier time than those with severe multisystem trauma, there may be factors other than timing of treatment that influence the lower rates of morbid complications among those treated early. This type of bias in studies of therapies has been termed confounding by indication or confounding by severity and is a major threat to the validity of conclusions drawn from observational studies.
In randomized controlled trials, the randomization process will, on the average, evenly balance both the known and the unknown confounders, and this guarantees the validity of the statistical test used. The randomization process makes it possible to ascribe to the difference in outcome a probability distribution that is not influenced by differences in any prognostic factor other than the intervention under investigation^{9}. The chisquared test for twobytwo tables and the Student t test for comparison of two means can be justified on the basis of randomization alone without making further assumptions concerning the distribution of baseline variables. In the absence of randomization, additional design and analysis methods are needed to account for sources of bias that arise from a lack of comparability between groups. Unfortunately, most analytical methods, several of which will be discussed in greater detail below, only account for known sources of bias and come with their own set of assumptions that can only be tested partially.
Analytical Techniques
Analytical methods for observational studies vary widely and are chosen according to the type of study that is being performed. Most case series require very basic descriptive statistics, such as probabilities or simple averages. Therapeutic and prognostic studies strive to give unconfounded estimates of association and therefore incorporate more elaborate techniques, each with relative strengths and weaknesses (Table I). Because therapeutic studies can be thought of as a special case of the prognostic study in which we are only interested in the effect associated with one risk factor (such as a specific treatment), the analytical methods used to control confounding are similar and will be presented together. It is important to remember that this distinction among the three observational study types is made to highlight key features and that elements of all three may be present in the same study.
TABLE I 
Relative Strengths and Weaknesses of Methods of Analysis of Therapeutic and Prognostic Nonrandomized Studies
Methods 
Strengths 
Weaknesses 
Matching 
Simple 
Limits sample size 
 Efficient sampling method, especially in casecontrol studies 
Unable to fully explore associations with matched factors 

 Potential for overmatching 
Stratification 
Simple 
Difficult to interpret with multiple subgroups 
 Easy to see effect modification 

Multivariable adjustment 
Efficient simultaneous adjustment for multiple confounders 
Quality of estimates subject to fit and assumptions of model 
 Ability to easily assess effects of individual factors 

Propensity scores 
Ability to directly see confounding through distribution of the propensity score 
Potential remains for bias from unknown confounding 
 Intuitive and simplified means of matching on single number 
Possible to miss effect modification 
 Rare outcomes 

 Confounding adjustment more robust to modeling assumptions 

Instrumental variables 
Ability to get unconfounded estimates despite not having observed all possible confounders 
Cannot test all instrumental variable assumptions 

 Inference restricted only to subjects whose treatment is impacted by the instrumental variable 
Analysis of a Case Series
Case series, while occupying the lower rung in the hierarchy of evidence (Level IV), can provide extremely useful information to care providers and patients if they satisfy certain criteria. First, the target population must be definable and the study sample must be representative. Second, the intervention must be reproducible so that a surgeon, with adequate training, can expect similar results if the procedure is faithfully replicated. Third, the outcomes that are measured should be clinically important. And finally, followup should be as complete as possible to limit loss of precision and to avoid selection bias. If these criteria are met, a simple descriptive statistic, such as risk (number of new cases per number at risk), rate (number of new events per unit of time), or mean (numerical average), along with confidence intervals (generated from a statistical model of a probability distribution such as the binomial for risk data or Poisson for rates)^{10}, can set an important benchmark for providers and be very helpful in providing information with regard to patient expectations.
A classic example of such a case series is the one reported by Letournel and Judet^{11}. That welldefined series, which consisted of 940 operatively treated displaced acetabular fractures and a followup period of more than thirtythree years, has come to represent the socalled gold standard in the treatment of fractures of the acetabulum^{11}. Letournel and Judet reported that, of the 567 hips that were operated on within twentyone days, 73.7% were assessed as perfect reductions. Between three weeks and four months after injury, the probability of a perfect reduction among 150 hips decreased to 64.7%. Had 95% confidence intervals been reported, they would have looked something like 70.0% to 77.3% prior to three weeks and 56.5% to 72.3% after (both intervals calculated with use of the exact binomial method) and even more information about the potential uncertainty in making inferences about the larger theoretical population of all displaced acetabular fractures could have been conveyed. Although as many as 18% of patients were either lost to followup or had incomplete data, the results have been reproduced in case series reported by others^{12}^{}^{14}, adding to a relatively coherent body of data regarding a relatively uncommon condition. As multivariable analysis becomes more familiar to orthopaedic investigators and readers, it is likely that case series (at least large ones) will increasingly both tell us about natural history and identify risk factors (such as age, fracture type, and length of followup) associated with outcome.
Analysis of Therapeutic and Prognostic Studies
Confounding threatens to bias estimates of associations of risk factors or treatment with outcome. The following set of methods includes matching, stratification, and multivariable regression and can be used to control for confounding for either type of study. Two more advanced multivariable methods that have been used increasingly in the analysis of therapeutic studies—propensity score analysis and instrumental variable analysis—are also discussed. Most of these methods rely on making the level of one or more factors constant in order to study the variability in outcome that is specifically associated with a change in the treatment or risk factor of interest. The interpretation of such analyses is therefore appropriately described as conditional, as an accurate interpretation is dependent on whether other known variables are held constant (with the exception of instrumental variable analysis). This is typically what is meant by “statistical adjustment” or “controlling” for confounding.
Matching
Matching is a conceptually straightforward strategy whereby confounders are identified and subjects in the treatment groups are matched on the basis of these factors so that, in the end, the treatment groups are “the same” with regard to these factors. Matching can either be done on a onetoone basis (matched pairs) or on the basis of frequencies (that is, the confounder is presenting an equal percentage of subjects in each group), and subjects can be matched with respect to a single confounder or multiple confounders. Matching can be used in both prospective and retrospective observational study designs (including casecontrol studies). For example, Ciminiello et al.^{15} examined the impact of small incisions (<5 cm) on a variety of outcomes, including blood loss, operative time, and postoperative complications, in patients undergoing primary total hip arthroplasty. To ensure that the group of patients who received a small incision was as homogeneous as possible with the comparator group of patients who received a standardsize incision, the authors used a matchedpair cohort design. They matched sixty patients in each group on a variety of potentially confounding factors, including age, sex, body mass index, American Society of Anesthesiologists score, diagnosis (osteoarthritis), prosthesis, type of fixation, anesthesia, surgical approach, and positioning, and were unable to identify any significant differences in outcome between the two techniques.
While matching is an effective way of balancing multiple confounders, it is also associated with several important limitations. One is that it may be difficult or impossible to find exact matches between the two groups of patients, and this difficulty increases rapidly as the number of factors to be matched increases. Matching may eliminate substantial numbers of subjects due to an inability to match all subjects, which results in a decreased sample size and power. One solution is to match patients within a reasonable range (for example, age ± five years), meaning that the range is such that differences of prognostic importance are not believed to exist. Another problem is that matching generally precludes the evaluation of the underlying relations between matching variable and exposure in a prospective cohort study and matching factor and outcomes in a casecontrol study. This is because of the sampling schemes (based on exposure for the prospective cohort study and outcome for the casecontrol study) and the way that balance is forced with respect to the matching factor in each of these designs. Finally, if matching is undertaken on variables that are not true confounders, a loss of statistical power can result; moreover, in a casecontrol study, such “overmatching” can create a new bias^{16}. Therefore, matching should be used cautiously and only on factors that are strongly associated with the outcome of interest and believed to be differentially distributed between treated and untreated subjects (that is, only on true confounders).
Stratification
Stratification is related to matching and provides another means by which to control confounding. Potentially confounding variables are identified, and the cohort is grouped by levels of this factor. The analysis is then performed on each subgroup within which the factor remains constant, thereby removing the confounding potential of that factor. Bosse et al.^{17} undertook a study to assess the impact of reamed intramedullary nailing compared with plate fixation of femoral shaft fractures on several adverse outcomes, including adult respiratory distress syndrome (ARDS) and multiple organ failure. Because the presence of chest injury could cause differences in surgical approach and impact the outcome of interest, it was considered a potential confounder and a stratified analysis was undertaken. Table II shows the crude or unadjusted analysis of the total cohort as well as the subgroup analysis stratified by the presence or absence of chest injury. The stratified analysis shows subgroup risk differences that, in this case, are not statistically different from one another (that is, no interaction); therefore, the adjusted summary risk difference^{18} can be reported as unconfounded by chest injury.
TABLE II 
Stratified Analysis of Nail Fixation Compared with Plate Fixation and the Effect on Development of Adult Respiratory Distress Syndrome or Multiple Organ Failure
^{17}
 Chest Injury 
 No Chest Injury 
 Total Cohort 

 Nail 
Plate 
Nail 
Plate 
Nail 
Plate 
Developed adult respiratory distress syndrome or multiple organ failure 
5 
2 
4 
1 
9 
3 
Number at risk 
117 
104 
118 
114 
235 
218 
Risk 
0.043 
0.019 
0.033 
0.009 
0.038 
0.014 
Risk difference (95% confidence interval) 
0.024 (−0.022, 0.069)^{*} 
 0.025 (−0.012, 0.062)^{*} 
 0.024 (−0.004, 0.054) 

Summary risk difference (95% confidence interval) 
0.024 (−0.005, 0.053)^{†} 


 P value = 0.11^{‡} 

^{*}
Risk differences between strata are not significantly different; that is, no interaction (test for heterogeneity; p value = 0.96).
^{†}
Given the absence of interaction, a pooled summary validly estimates the risk difference, adjusting for chest injury.
^{‡}
Pvalue testing the null hypothesis of no association between treatment method and outcome, adjusting for chest injury and assuming no interaction.
Just as stratification allows for control over a confounding factor, it also facilitates investigation into whether the effect of interest is constant across levels of the factor by which stratification is undertaken. If the estimates among stratified groups are homogeneous, they can be averaged into a summary estimate that is unconfounded by the stratification variable, as mentioned above. Conversely, significant differences in effect (interaction) preclude averaging of treatment effects. In the study by Bosse et al.^{17}, if there were a difference in the effect of surgical treatment on complications depending on whether or not patients had a chest injury, the stratum risk differences would be reported separately for those with and without chest injury. Stratification is a useful strategy when there are only one or two risk factors or confounders, but it quickly becomes unmanageable and difficult to interpret when there are multiple confounders with multiple levels each. Although testing for interaction and estimating summary measures of effect based on stratification are often avoided in favor of reporting multivariable statistics, these details still provide important information that is often hidden in the reporting of the more sophisticated analyses that will be discussed below.
Multivariable Regression
The use of regression for the adjustment of multiple confounding factors is one of the most commonly used analytical techniques in therapeutic and prognostic studies. Regression analysis is based on modeling the mathematical relationships between two or more variables that give an approximate description of the observed data. Regression models should not be thought of as explanations of underlying mechanisms (that is, statistical models are not reality), but rather as simplifications that are compatible with the data and that provide us with some inference as to associations found in the data. These models are usually additive in that an observed dependent variable (such as the outcome of interest) can be explained by a model in which the effects of different influences or independent variables (including treatment of interest and other predictors of outcome or confounding factors) are added. Most analyses are based on the general linear model:$$E\left[Y\right]=A+{B}_{1}{X}_{1}+{B}_{2}{X}_{2}+\dots {B}_{p}{X}_{p}$$
where the expectation (or mean value) of Y is an additive combination of an intercept (A) and (p) explanatory independent variables multiplied by their respective coefficients (B_{1} through B_{p}). Each coefficient represents an estimate of effect or risk depending on the type of general linear model (e.g., mean difference for linear regression and log odds ratio for logistic regression). Multivariable analysis allows the association between dependent and independent variables to be estimated, while controlling for the influence of other independent variables.
The appropriate model depends on the type of data available, especially the type of outcome that is being assessed. Table III lists some commonly used models for typical outcome types. For example, Saleh et al.^{19} performed a casecontrol study to evaluate the predictors of surgical site infections (a binary outcome) complicating total knee and total hip replacement. These authors used multivariable logistic regression to control for several demographic, perioperative, and postoperative factors and found postoperative hematoma formation (p = 0.001) and persistent wound drainage (p = 0.01) to be the only significant associated risk factors. Ring et al.^{20} used multivariable linear regression to study the influence of various predictors on functional and qualityoflife outcomes, such as the Disabilities of the Arm, Shoulder and Hand (DASH) score (a continuous outcome) after capsulectomy for posttraumatic elbow stiffness. After adjusting for other factors, such as range of motion, they found significant associations of pain score (p < 0.001) and persistent ulnar nerve dysfunction (p < 0.01) with the DASH. Linear regression and logistic regression are two of the most commonly encountered strategies for multivariable adjustment.
TABLE III 
Appropriate Multivariable Adjustment Models for Common Types of Outcomes
Type of Outcome 
Example 
Model 
Estimate of Effect 
Binary 
Prevalence of postoperative infection 
Logistic regression 
Odds ratio 
Continuous 
Range of motion or functional outcome score (i.e., SF36) 
Linear regression 
Mean difference 
Timetoevent 
Time to reoperation following total hip arthroplasty 
Cox proportional hazards 
Hazard ratio 
Rate 
National rates of total joint replacement 
Poisson regression 
Rate ratio 
While fracturehealing or failure of implant fixation necessitating reoperation can be considered as binary outcomes (that is, yes or no) or continuous outcomes (that is, time until event happens), choosing to analyze those outcomes with one of the two aforementioned techniques can sacrifice information (element of time with logistic regression) or threaten basic assumptions of the model (timetoevent outcomes are notoriously skewed and thereby violate multivariable normality requirements of linear regression). Therefore, timetoevent outcomes ought to be analyzed with survival analysis techniques, and the most commonly used multivariable expansion of these methods is the Cox proportional hazards model^{21}. Bhandari et al.^{22} used this approach in a prognostic study of multiple risk factors for reoperation following initial operative management of fractures of the tibial shaft. After adjusting for over twenty possible variables, the researchers found that open fracture (p = 0.001), cortical continuity less than 50% (p < 0.001), and transverse fracture pattern (p < 0.001) predicted a relative increase in reoperation. The Cox model gives an estimate of effect that is analogous to the odds ratio from logistic regression (the hazard ratio) and can similarly be interpreted as a relative measure of risk of event associated with a unit change in a given predictor, holding other factors constant.
Although the results of such multivariable analyses are commonly presented, the details of how the models were selected are not. Readers may be led to assume that the results are accurate when they may have been derived with use of inappropriate models. Regression models assume, for example, that there is no effect modification or difference of effect between different levels of a confounder, as discussed earlier in the stratification example. Unless an interaction term (usually a product of two predictor variables) is inserted to represent effect modification, the model will not account for such a relationship, thus causing the researchers to arrive at the false conclusion that the effect of treatment on outcome is constant across all levels of another predictor, such as age or sex. Assumptions are made (such as the assumption of multivariable normality for linear regression or the assumption that the relative contribution of each factor is constant over time for the Cox model) when a model is fit, and it is important that these assumptions are verified and that the overall fit of the model to the data is assessed. Model fit is determined in terms of the amount of variability in the data as explained by the model and according to how well the model predicts individual outcomes for a given observation. There are many “diagnostic” procedures for assessing the most valid and bestfitting regression model. These procedures should be conducted by a statistician or experienced data analyst and are beyond the scope of this review but discussed elsewhere^{23,}^{24}. A description of model fitting methods helps validate reported results.
Propensity Score Analysis
Propensity score analysis^{25} is an approach to controlling for confounding through the generation of a score that “summarizes” the confounding by multiple variables. This form of analysis is a twostage approach in which, first, rather than modeling the outcome as a function of multiple risk factors, the probability of being treated is modeled, taking into account any possible confounding variables. This probability, usually generated by a logistic regression model, is the propensity score and ranges from 0 to 1. Once the propensity score is generated for each subject, it can be used to match them (usually within some narrow range), or perform stratified analysis on levels (such as deciles) of the propensity score, or it can be inserted into multivariable regression along with the treatment variable for use in estimating the outcome.
While orthopaedic investigators have been slower to apply propensity score analysis than medical specialty (cardiology and cardiothoracic surgery, in particular) researchers have been^{26}, there are some examples. McHenry et al.^{27} used propensity score analysis to control for confounding by indication for timing of treatment of surgically treated thoracic and lumbar spinal fractures in order to assess risk factors for respiratory failure following operative stabilization of these injuries (that is, this was a prognostic study in which propensity score methods were used to adjust for the strong likelihood of differential treatmenttime assignment due to injury severity). Subjects were matched based on the propensity score for treatment within fortyeight hours after injury. Logistic regression was then performed on the matched set, identifying age, Injury Severity Score, Glasgow Coma Scale score, the presence of blunt injury to the chest, and time until surgery of longer than two days as independent risk factors for respiratory failure. By matching on the propensity score, these investigators were able to limit bias due to an important surgeoncontrolled risk factor in assessing the relative importance of multiple prognostic factors.
By estimating the treatment mechanism, propensity score analysis offers several insights into the data and also offers theoretical advantages over conventional techniques of multivariable adjustment. First, propensity scores indicate the degree to which the likelihood of treatment differs between two groups and allows the investigator or reader to assess the ways in which the treatment groups are actually comparable (that is, the two groups should have fairly similar distributions of propensity scores to make the comparison tenable). Second, by matching or stratifying subjects on the basis of their likelihood of treatment, an understanding of how selection bias is countered becomes intuitive because comparisons are made only among those equally likely to have received treatment, as in a randomized controlled trial. Unfortunately, propensity score analysis is no more immune to threats caused by unknown and unmeasured confounders than the other methods already discussed^{25,}^{28}. Also, two recent systematic reviews have not shown significant differences in estimates from studies in which sidebyside conventional multivariable and propensity score analyses were performed^{29,}^{30}. While the use of propensity score analysis is growing quickly among many fields of research in medicine, guidelines for proper use of these methods have lagged^{31} and orthopaedic investigators should remain cognizant of emerging methodological work as they adopt this analytic approach.
Instrumental Variables
The instrumental variable approach to bias and confounding in medical research has been used frequently by economists for decades but has only recently been implemented in health research^{32}^{}^{34}. Health economists, who typically work with administrative data in which many confounders are potentially missing, commonly use the instrumental variable approach to examine questions about the quality and distribution of care. The theoretical advantage of using instrumental variable methodology in the analysis of observational therapeutic studies is that it offers the possibility of controlling for both known and unknown confounders and is therefore appealing when the threat of unobserved or unobservable confounders looms large. The idea is as follows: if a variable (the instrumental variable) can be identified that has the ability to cause variation in the treatment of interest but that has no impact whatsoever on outcome (other than through its direct influence on treatment), then it will be possible to estimate the magnitude of that induced variation and its effect on outcome. Figure 4 provides a schematic diagram of this prerequisite relationship for the identification of a useful and valid instrument. Instrumental variables can be thought of as achieving pseudorandomization, and a randomized controlled trial is a special case in which the random number assignment (e.g., a fair coin toss) is the instrumental variable inducing variation in the outcome variable.
Examples of the use of instrumental variable analysis are similarly rare in the orthopaedic literature. McGuire et al.^{35} used a large Medicare data set to examine the controversial topic of the impact of timing of fixation of hip fractures on mortality. Acknowledging the fact that there are likely to be factors beyond those which are measured in such a large administrative data set, these investigators chose dayoftheweek grouping (Saturday through Monday compared with Tuesday through Friday) as an instrumental variable by which to pseudorandomize the cohort. The authors cited evidence showing that day of the week is a strong predictor of delay to operative treatment of hip fracture and assumed that the day of the week that the hip is broken should have no independent influence on mortality or have an association with other confounders such as the presence of comorbidities. The instrumental variable analysis showed an increased risk of mortality (risk difference, 15%; p = 0.047) among patients undergoing surgery more than two days after admission. It is likely that this methodology will grow in popularity among researchers who are trying to draw unconfounded estimates of effects of similar healthcare decisions from large data sets for clinical practice and policymaking reasons.
While the thought that one can avoid the issue of unobserved (and therefore uncontrolled for) confounding in observational studies is very appealing, there are certain important limitations in the use of these methods to establish causality. First, identifying an instrumental variable that meets the assumptions of no association with outcome, independent of treatment, is difficult. Because this assumption is not directly testable, there must be general consensus that the instrumental variable is tenable. In comparing instrumental variables with standard multivariable adjustment or propensity score techniques, one is trading the assumption that was just mentioned for the assumption that there is no unmeasured confounding, which is also not directly testable. Another important consideration is that the effect that is measured only applies to those whose treatment was affected by the instrumental variable. In the study by McGuire et al.^{35}, a 15% increase in the risk of mortality associated with delay of surgery applies only to the patient whose treatment timing was influenced by the day of the week on which he or she was admitted. This socalled ^{marginal patient36} in a cohort study is important to distinguish from patients in the entire study sample, to whom any average treatment effect can be inferred in a randomized controlled trial.
Interpretation and Reporting of Results
The reporting or interpretation of results from observational studies must be tempered with the limitations implicit both in the data and in the methods applied to the analysis of those data. Matching and stratification provide a means to limit confounding by another factor by holding its level constant in the analysis. Conventional multivariable adjustment offers the power to adjust for multiple confounders at the same time, advancing the pursuit of potential causal relationships. Still, multiple other criteria are required to establish causation^{37}. Multivariable adjustment cannot give causation unless factors such as appropriate temporal ordering of predictors and outcome are ensured and there are no unaccountedfor confounders missing from the analysis. While propensity score analysis offers a more plausible accounting for the multivariable nature of confounding and for the balancing of confounding by indication, causal interpretation is still limited by the same requirements. And, as just discussed, causal interpretation from an instrumental variable analysis is contingent on universal acceptance of the instrumental variable that is chosen. These limitations ought to be acknowledged in the reporting of results.
Other important limitations to the validity of observational (and randomized) studies include missing data and loss to followup or censoring. Missing data and censoring are a form of selection bias in that those with complete data or followup may differ systematically in their association with outcome from those without complete data or followup. In the most benign sense, data missing at random should only lessen the precision or power of a study. However, when this is not the case, substantial biasing of estimates may result and there is no completely valid solution to this dilemma. There are numerous methods that have been described to account for missing data, the most robust of which is multiple imputation^{38}. In dealing with the problem of patients lost to followup, sensitivity analysis (assigning all of those with incomplete followup to one or the other outcome) can at least put boundaries around the range of effect that may have been witnessed had complete followup been achieved.
Finally, given the number of techniques described in this review—each with details that could not be thoroughly discussed here—it is vital for investigators to completely report on how the analyses were undertaken. From choice of confounders to control for the details of the statistical procedure used, it is vital that enough information be given that an independent analyst can reliably reproduce the reported results. While considerations such as model fit and modelassumption checking may not make it into a final paper, many journals will use this information initially in judging the quality of the manuscript and may then make the information available online or in an appendix.
Summary
Observational studies will continue to provide an important method for clinical investigation in orthopaedic surgery in settings in which randomized controlled trials are not feasible and when increased generalizability of findings is desired. Case series will continue to provide important information on natural history and the prevalence of certain diseases or outcomes and will serve to generate hypotheses for future research. Furthermore, their analysis with descriptive statistics is relatively simple. Prognostic studies will help investigators further understand the risk factors that are associated with certain outcomes, and therapeutic studies, especially when prospectively conducted, will provide the nextbest level of evidence to randomized controlled trials in informing treatment decisions. Confounding bias represents a major obstacle to drawing valid conclusions from such studies, and current analytic approaches have been reviewed here for orthopaedic investigators and readers of the literature. An understanding of the respective strengths and weaknesses of the various analytic approaches is necessary for proper application and interpretation. While each approach has its limitations and assumptions, any one of these approaches can be used as a powerful tool in understanding observational data in clinical research.
References
1. Moore DS, McCabe GP. Introduction to the practice of statistics. 4th ed. New York: W.H. Freeman; 2003.
2. Altman DG. Practical statistics for medical research. London: Chapman and Hall/CRC; 1991.
3. Lehmann EL. Nonparametrics: statistical methods based on ranks. New York: McGrawHill; 1975.
4. Altman DG, Gore SM, Gardner MJ, Pocock SJ. Statistical guidelines for contributors to medical journals. In: Gardner MJ, Altman DG, editors. Statistics with confidence. London: British Medical Journal; 1989. p 83100.
5. Rothman KJ, Greenland S, editors. Modern epidemiology. 2nd ed. Philadelphia: Lippincott Williams and Wilkins; 1998.
6. Hurwitz SR, Tornetta P 3rd, Wright JG. An AOA critical issue; how to read the literature to change your practice: an evidencebased medicine approach. J Bone Joint Surg Am. 2006;88:18739.
7. Bender R, Lange S. Adjusting for multiple testing—when and how? J Clin Epidemiol. 2001;54:3439.
8. Sackett DL. Bias in analytic research. J Chronic Dis. 1979;32:5163.
9. Byar DP, Simon RM, Friedewald WT, Schlesselman JJ, DeMets DL, Ellenberg JH, Gail MH, Ware JH. Randomized clinical trials. Perspectives on some recent ideas. N Engl J Med. 1976;295:7480.
10. Selvin S. Practical biostatistical methods. Belmont: Duxbury Press; 1995.
11. Letournel E, Judet R. Fractures of the acetabulum. 2nd ed. New York: Springer; 1993.
12. Moed BR, Carr SE, Watson JT. Open reduction and internal fixation of posterior wall fractures of the acetabulum. Clin Orthop Relat Res. 2000;377:5767.
13. Matta JM. Fractures of the acetabulum: accuracy of reduction and clinical results in patients managed operatively within three weeks after the injury. J Bone Joint Surg Am. 1996;78:163245.
14. Liebergall M, Mosheiff R, Low J, Goldvirt M, Matan Y, Segal D. Acetabular fractures. Clinical outcome of surgical treatment. Clin Orthop Relat Res. 1999;366:20516.
15. Ciminiello M, Parvizi J, Sharkey PF, Eslampour A, Rothman RH. Total hip arthroplasty: is small incision better? J Arthroplasty. 2006;21:4848.
16. Day NE, Byar DP, Green SB. Overadjustment in casecontrol studies. Am J Epidemiol. 1980;112:696706.
17. Bosse MJ, MacKenzie EJ, Riemer BL, Brumback RJ, McCarthy ML, Burgess AR, Gens DR, Yasui Y. Adult respiratory distress syndrome, pneumonia, and mortality following thoracic injury and a femoral fracture treated either with intramedullary nailing with reaming or with a plate. A comparative study. J Bone Joint Surg Am. 1997;79:799809.
18. Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst. 1959;22:71948.
19. Saleh K, Olson M, Resig S, Bershadsky B, Kuskowski M, Gioe T, Robinson H, Schmidt R, McElfresh E. Predictors of wound infection in hip and knee joint replacement: results from a 20 year surveillance program. J Orthop Res. 2002;20:50615.
20. Ring D, Adey L, Zurakowski D, Jupiter JB. Elbow capsulectomy for posttraumatic elbow stiffness. J Hand Surg [Am]. 2006;31:126471.
21. Kleinbaum DG. Survival analysis: a selflearning text. Berlin: Springer; 1996.
22. Bhandari M, Tornetta P 3rd, Sprague S, Najibi S, Petrisor B, Griffith L, Guyatt GH. Predictors of reoperation following operative management of fractures of the tibial shaft. J Orthop Trauma. 2003;17:35361.
23. Hosmer DW, Lemeshow S. Applied logistic regression. New York: John Wiley and Sons; 1989.
24. Kleinbaum DG, Kupper LL, Muller KE, Nizam A. Applied regression analysis and multivariable methods. 3rd ed. Belmont: Duxbury Press; 1998.
25. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:4155.
26. Stürmer T, Schneeweiss S, Rothman KJ, Avorn J, Glynn RJ. Performance of propensity score calibration—a simulation study. Am J Epidemiol. 2007;165:11108.
27. McHenry TP, Mirza SK, Wang J, Wade CE, O'Keefe GE, Dailey AT, Schreiber MA, Chapman JR. Risk factors for respiratory failure following operative stabilization of thoracic and lumbar spine fractures. J Bone Joint Surg Am. 2006;88:9971005.
28. Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc. 1984;79:51624.
29. Stürmer T, Joshi M, Glynn RJ, Avorn J, Rothman KJ, Schneeweiss S. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. J Clin Epidemiol. 2006;59:43747.
30. Shah BR, Laupacis A, Hux JE, Austin PC. Propensity score methods gave similar results to traditional regression modeling in observational studies: a systematic review. J Clin Epidemiol. 2005;58:5509.
31. Weitzen S, Lapane KL, Toledano AY, Hume AL, Mor V. Principles for modeling propensity scores in medical research: a systematic literature review. Pharmacoepidemiol Drug Saf. 2004;13:84153.
32. Newhouse JP, McClellan M. Econometrics in outcomes research: the use of instrumental variables. Annu Rev Public Health. 1998;19:1734.
33. McClellan M, McNeil BJ, Newhouse JP. Does more intensive treatment of acute myocardial infarction in the elderly reduce mortality? Analysis using instrumental variables. JAMA. 1994;272:85966.
34. Greenland S. An introduction to instrumental variables for epidemiologists. Int J Epidemiol. 2000;29:7229.
35. McGuire KJ, Bernstein J, Polsky D, Silber JH. Delays until surgery after hip fracture increases mortality. Clin Orthop Relat Res. 2004;428:294301.
36. Harris KM, Remler DK. Who is the marginal patient? Understanding instrumental variables estimates of treatment effects. Health Serv Res. 1998;33(5 Pt 1):133760.
37. Hill AB. The environment and disease: association or causation? Proc R Soc Med. 1965;58:295300.
38. Rubin DB, Schenker N. Multiple imputation in healthcare databases: an overview and some applications. Stat Med. 1991;10:58598.