Background: Clinical studies assessing orthopaedic interventions often include data from two limbs or multiple joints within single individuals. Without appropriate design or statistical approaches to address within-individual correlations, this practice may contribute to false precision and possible bias in estimates of treatment effect. We conducted a systematic review of the orthopaedic literature to determine the frequency of inappropriate inclusion of nonindependent limb or joint observations in clinical studies.
Methods: We identified seven orthopaedic journals with high Science Citation Index impact factors and retrieved all clinical studies for 2003 for any intervention on any limb or joint.
Results: We identified 288 clinical studies, 143 of which involved two limbs or multiple joint observations from single individuals. These studies included nineteen randomized clinical trials (13%) fifty-eight two-group cohort studies (41%), and sixty-six one-group cohort studies (46%). Seventy-six (53%) of the 143 studies involved statistical comparisons between patient groups with use of tests of association, and an additional sixty studies (42%) presented estimates of proportions without statistical comparisons. Only sixteen of the seventy-six studies involving statistical comparisons involved the use of any technique or methodological approach to account for multiple, nonindependent observations. A median of approximately 13% of the patients in these studies contributed more than one observation. The median proportion of nonindependent observations to total observations (the unit of analysis) was approximately 23%.
Conclusions: Our findings suggest that a high proportion (42%) of clinical studies in high-impact-factor orthopaedic journals involve the inappropriate use of multiple observations from single individuals, potentially biasing results. Orthopaedic researchers should attend to this issue when reporting results.
1 Faculty of Health Sciences, The University of Western Ontario, Elborn College Room 1438, London, ON N6G 1H1, Canada. E-mail address: firstname.lastname@example.org
2 Michael G. DeGroote Centre for Learning and Discovery, Room 3308, Faculty of Health Sciences, McMaster University, 1200 Main Street West, Hamilton, ON L8N 3Z5, Canada
3 Clinical Epidemiology and Biostatistics, Faculty of Health Sciences, McMaster University, Hamilton Civic Hospital Research Centre, Henderson General Hospital, 711 Concession Street, Hamilton, ON L8V 1C3, Canada. E-mail address for R. Roberts: email@example.com
4 Clinical Epidemiology and Biostatistics, Hamilton Health Sciences Center, Room 2C12, McMaster University, 1200 Main Street West, Hamilton, ON L8N 3Z5, Canada. E-mail address for G. Guyatt: firstname.lastname@example.org
One of the assumptions of common statistical tests that are used to compare treatment groups, such as the t test, analysis of variance, and the chi-square test, is that each observation is independent of all other observations1. When observations are not independent, the dependency between them has the potential to increase or decrease variance within that group2,3. Thus, when two limbs or multiple joints from a single patient are counted as independent observations without proper methodological considerations within the study design or adjustment in the analysis to account for within-observation correlation, the precision of the estimate may be falsely improved and the potential for a biased estimate increases2-4.
To illustrate what we mean by dependency between observations, consider a patient who undergoes bilateral shoulder arthroplasty, with the shoulder replacements occurring within months of each other. No matter how the first shoulder responds to the intervention, the outcome for the second shoulder is likely to be affected, either positively or negatively, by the outcome for the first shoulder, creating an association between the outcomes for each shoulder. For instance, if the first operation results in a poor outcome, the patient may rely excessively on the second shoulder, and the extra wear and tear may result in a poor outcome in that shoulder as well. Conversely, if the result in the first shoulder is excellent, it may mean that the result in the second shoulder is more likely to be satisfactory. In the extreme, when all unfavorable outcomes in the first shoulder destine the second shoulder to an unfavorable outcome, and when all good outcomes in the first shoulder destine the second shoulder to a good outcome, there is only one observation per patient (and not two observations, as a naïve approach would credit). The result is not only to spuriously increase the precision of the results, which may heighten the probability of reaching significance when in truth no difference in effect exists, but also to increase the likelihood of a biased estimate. However, the degree of association between limbs or joints is difficult to predict, and the eventual magnitude of the increased precision or magnitude and direction of bias caused by the association between nonindependent observations is therefore uncertain.
To assess the prevalence of the inclusion of nonindependent observations in studies evaluating orthopaedic interventions, we conducted a systematic review of all studies published in seven high-impact-factor orthopaedic journals in the year 2003. We hypothesized that a substantial proportion of studies that make use of nonindependent observations from single patients do not make any adjustment for within-observation correlation in the analysis, potentially biasing the reported results.
Materials and Methods
We consulted the ISI Journal Citation Reports for orthopaedic journals with high Science Citation Index impact factors that did not focus primarily on basic science or nonhuman studies, non-limb or non-joint interventions, or the axial skeleton. We manually searched all issues of The Journal of Bone and Joint Surgery (American Volume), The Journal of Bone and Joint Surgery (British Volume), Clinical Orthopaedics and Related Research, The American Journal of Sports Medicine, Journal of Shoulder and Elbow Surgery, The Journal of Arthroplasty, and Arthroscopy: The Journal of Arthroscopic and Related Surgery for the year 2003. All reports in these journals were written in the English language.
We retrieved all clinical studies involving groups of patients who underwent an intervention on any limb and/or joint (including the foot, ankle, leg, knee, thigh, hip, shoulder, arm, elbow, forearm, or wrist) that reported any outcome measure.
Data Abstraction and Validity Assessment
One author (T.C.H.) conducted the initial manual search and retrieved all applicable studies involving limb or joint interventions. Independently and successively, two authors (T.C.H. and D.B.) applied the eligibility criteria and extracted data from each retrieved study. In addition to recording the author name, study title, journal name, issue, and page numbers, the reviewers extracted the total number of patients, the type of intervention for each patient group, the number of patients in each group, the number of limbs or joints in each group, the total number of limbs or joints, the type of statistical analysis used, the presence of appropriate statistical or methodological control, and the finding of a significant result.
A full assessment of individual study quality was not necessary for the purposes of this review. Instead, assessment of the presence or absence of a suitable approach to compensate for the use of nonindependent observations from single patients was the sole measure of study validity that was assessed. Examples of appropriate statistical or methodological approaches included the adjustment of standard errors to account for clustering of data at the patient level, the analysis of patients undergoing a bilateral intervention as a separate group, or the analysis of data from only one included limb or joint.
We assigned each study to one of three predefined categories. The first category was termed “randomized clinical trials,” which encompassed studies in which patients were randomly allocated to treatment groups and were followed longitudinally for the purpose of assessing the effect of treatment on outcome. This group included both a between-subject design (in which patients are randomly assigned to receive one treatment or the other and comparisons are made between patients) and a within-subject design (in which one limb of a patient with bilateral disease is randomly assigned to one intervention and the contralateral limb is assigned to the other intervention and comparisons are made between the limbs of individual patients). The second category was termed “two-group comparisons” and encompassed both prospective and retrospective cohort studies involving multiple, nonrandomly allocated groups of patients for the purpose of assessing the effect of treatment on outcome. The third category, termed “single-group studies,” included prognostic studies and case series in which only one group of patients was included but in which comparisons were made between subgroups of patients determined after the outcome had been assessed. This review did not classify studies on the basis of intervention or outcome measures.
We used proportions to express the ratio of clinical studies that included two limbs or multiple joints from single patients to the total number of clinical studies overall and by the type of study design. We used logistic regression to determine whether there was a significant association between journal impact factor (independent variable) and the presence or absence of nonindependent observations in all reviewed articles (dependent variable). Furthermore, we calculated the proportion of each study's sample size that was made up of nonindependent observations by determining the number of patients who contributed multiple observations and dividing by the total number of patients in each study. To begin to illustrate the potential magnitude of the bias that is being introduced when calculating the estimate of treatment effect, we calculated the proportion of nonindependent observations (number of limbs or joints) to the total number of observations (limbs or joints), which is the unit of analysis being used when investigators are making statistical comparisons between groups.
The manual search identified 288 clinical studies evaluating an intervention on a limb or joint in any orthopaedic patient population, of which sixty-two (22%) were randomized clinical trials (five of which were within-subject designs), 123 (43%) were two-group comparisons, and 103 (36%) were single-group studies. Of these 288 studies, 145 (50%) did not include two limbs or multiple joint observations from single patients and were excluded from this review. Thus, 143 (50%) of the studies involved two limbs or multiple joint observations from single patients, including nineteen (31%) of the randomized clinical trials, fifty-eight (47%) of the two-group comparisons, and sixty-six (64%) of the single-group studies. In two of these 143 studies, the authors reported only the number of limbs and did not report the number of patients, preventing us from including these studies in our calculations. Included studies were equally distributed across the seven journals and, with the numbers available, the proportion of studies identified from each journal was not related to journal impact factor (p = 0.54) (Table I).
Of the 143 studies that involved two limbs or multiple joints from single patients, seventy-six (53%) used a statistical test to make comparisons between groups of patients and sixty (42%) evaluated results with use of proportions but without use of statistical comparisons. Of the seventy-six studies that involved the use of statistical tests to make comparisons between groups of patients, only sixteen (21%) involved some kind of analytic approach, statistical control, or methodological design to adjust for within-patient relationships between limbs. Of the sixty studies that did not involve the use of an appropriate statistical adjustment, fifty (83%) demonstrated at least one significant result.
The mean proportion of the sample size of studies consisting of patients who contributed more than one observation was 18% ± 17% (median, 13%). The mean proportion of non-independent observations was 27% ± 21% (median, 23%).
An association between two limbs or multiple joints within an individual will depend on biologic characteristics within the patient, lifestyle choices, and external influences, including preferential use of one limb or joint while the other recovers or rehabilitates. While biologic rationale suggests that data from multiple observations from two limbs or multiple joints from the same patient are likely to be moderately to highly correlated, resulting in overestimates or underestimates of the within-group variance2,3, empirical evidence would strengthen the case for concern. A recent study by Lie et al5 provided such information. Those authors performed a survival analysis of 47,355 patients (55,782 primary hip replacements) and found that, in patients undergoing bilateral primary hip replacement with more than two years between the procedures, there was an increased risk of revision of the first prosthesis compared with that in patients who underwent unilateral replacement only (relative risk = 1.25, p = 0.07) and that the second prosthesis shared a similar increased risk of revision (relative risk = 1.25, p = 0.032). Furthermore, there was evidence of a protective effect on the second prosthesis if both replacements were performed less than two years apart (relative risk = 0.80, p = 0.0026). Finally, they found a much greater risk for revision in the index hip if the contralateral hip had been revised (relative risk = 3.49, p = 0.0001).
There is a high probability that within-patient correlations contributed spurious precision and an unknown amount of bias in the reported statistical results of 120 (42%) of 288 studies identified in the present review. Without the raw data from each study, it is not possible to accurately quantify this bias, but, considering the finding that a median of 23% of observations across studies were not independent, the magnitude of bias has the potential to be substantial.
The extent of bias is also dependent on the distribution of patients who have bilateral involvement, where each limb was counted as an independent observation in the analysis. If such patients are equally and randomly distributed between treatment groups, the problems of bias arising from within-individual correlations are reduced, although the study remains at risk of spurious precision. If, however, one group contains a higher proportion of patients who contribute non-independent observations (e.g., bilateral involvement), the effect of treatment could easily be magnified or minimized, depending on whether the treatment is effective and on the magnitude of the correlation between limbs within the sample.
For example, consider a study conducted on a population in which there is a highly positive correlation between outcomes for two limbs for a single individual, meaning that if one intervention fails, the other will also fail. Consider also that there is an unequal balance of patients with bilateral involvement between groups so that there are more patients with bilateral involvement in the control group. If the truth is that there is no difference in the effect of the intervention between the treatment and control groups, the risk of failure in the control group will appear to be greater if by chance the failures occur in patients with bilateral involvement. The reason is that, since the unit of analysis is the limb and not the patient, the patient who has surgery on both limbs will be effectively counted twice and the variance within the control group will be reduced. By reducing the variance, the chance of making a Type-I error, or pro-claiming a significant difference in favor of the treatment group when one does not truly exist, is increased. This appearance of greater risk of failure in the control group is then not necessarily related to the intervention but can be attributed to the positive correlation between limbs.
In studies that do not statistically compare data from patient groups but rather present proportions of patients experiencing an event, as is the case in prognostic or risk-factor study designs, the external validity or generalizability of the results is limited to populations with similar proportions of patients with bilateral involvement. For example, in a study involving many patients with bilateral involvement, the frequency of events is systematically overestimated or underestimated compared with that in populations with different proportions of patients with bilateral involvement, and the magnitude of this discrepancy depends on the magnitude and direction of the correlation between limbs within an individual. Whereas patients receiving a unilateral intervention can only contribute one “success” or “failure,” correlated multiple observations from patients with bilateral involvement have the potential to result in duplicate outcomes, inflating or reducing event rates that would be expected in populations in which the proportion of patients with bilateral involvement is dissimilar to that in the study sample. Furthermore, if clinicians make prognostic estimates from the relative proportion of patients experiencing an event or outcome in a group of patients, the use of patients with bilateral involvement jeopardizes the ability to provide an accurate estimation of the likelihood of an event in individuals with and without bilateral disease and instead could spuriously increase the precision of the now-inaccurate estimate by providing more observations with less variability between observations.
To our knowledge, this is the first systematic investigation of the orthopaedic literature to determine the prevalence of inappropriate analysis involving multiple limbs or joints in individual patients. The rigorous literature search, the duplicate examination of retrieved studies, and the systematic design all reduce the likelihood of bias in our results and associated inferences6. The major limitation of the present study is that, without access to the original data, we could not provide accurate estimates of the magnitude of bias or increased precision of the included studies or about the frequency of spuriously positive studies with p values below the conventional threshold. Another limitation is the exclusion of journals with lower impact factors from our review, although there was no significant relationship between impact factor and adherence to methodologic standards. Finally, although we comprehensively examined articles published in seven journals during a single year, it is possible that if we had chosen to conduct our review using articles published during other years or across several years, we may have had different findings.
In conclusion, as a general recommendation, we suggest that before initiating any orthopaedic investigation involving an intervention on multiple limbs or joints in individual patients, investigators should consult with an experienced methodologist and biostatistician. The methodologist can help to plan ways to incorporate patients who have bilateral involvement into trials, and the biostatistician can help to formulate the plan of analysis and to determine the degree of within-patient correlation so that the estimate of variability within each group is closer to the true underlying variance and not an overestimation or underestimation that has been influenced by the direction and magnitude of the association between limbs or joints in individual patients. Other options include excluding the second limb or joint from the study, randomly choosing which limb or joint to include in the analysis, or analyzing bilateral patients as a distinct subgroup. ▪
The authors did not receive grants or outside funding in support of their research for or preparation of this manuscript. They did not receive payments or other benefits or a commitment or agreement to provide such benefits from a commercial entity. No commercial entity paid or directed, or agreed to pay or direct, any benefits to any research fund, foundation, educational institution, or other charitable or nonprofit organization with which the authors are affiliated or associated.
Investigation performed at McMaster University, Hamilton, Ontario, Canada