Many introductory epidemiology textbooks describe effect measures obtained from observational research as susceptible to errors arising from chance, confounding and bias, including measurement error. ^{1} More advanced texts separate these sources of error into those deriving from random error, which is assessed by the effect measure’s precision, and those deriving from systematic error, which is assessed by the effect measure’s validity. ^{2–5} If the error about an effect estimate equals its difference from the truth, then the random error is that which approaches zero as the study size increases, and the systematic error is that which does not. A quantitative assessment of the systematic error for an effect estimate can be made by sensitivity analysis.

To improve the precision of an effect estimate, epidemiologists design their studies to gather as much information as possible, ^{6} apportion the information efficiently among the strata of variables that affect the outcome, ^{6} and undertake precision-optimizing analyses such as pooling ^{7} and regression. ^{8} Even with an efficient design and analysis, epidemiologists customarily present a quantitative assessment of the remaining random error about an effect estimate. Although there has been considerable ^{9–12} and continuing ^{13,14} debate about methods of describing random error, a consensus has emerged in favor of the frequentist confidence interval. ^{15}

To improve the validity of an effect estimate, epidemiologists design their studies to assure comparability of the exposed and unexposed groups, ^{16} reduce differential selection forces, ^{17,18} and control measurement error by obtaining accurate information or forcing the direction of its expected error to be predictable. ^{19,20} When the validity might be compromised by confounding after implementation of the design, epidemiologists employ analytic techniques such as stratification ^{7} or regression ^{8} to maintain the validity of the effect estimate. Analytic corrections for selection forces or measurement error are seldom seen. Quantitative assessments of the remaining systematic error surrounding an effect estimate are rare indeed.

Thus, the quantitative assessment of the error surrounding an effect estimate usually reflects only the residual random error. Much has been written, ^{10,12,15,21} and many examples proffered, ^{22,23} about the abuses made of these quantitative assessments of random error. The near complete absence of quantitative assessments of residual systematic error in published epidemiologic research has received much less attention. Several reasons likely explain this inattention.

First, existing custom does not expect a quantitative assessment of the systematic error for an effect estimate. For example, the Uniform Requirements for Manuscripts Submitted to Biomedical Journals instructs authors to “quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals),”^{24} which measure only residual random error. With no demand to drive development and no habit to breed familiarity, few methods are available to quantify the systematic error for an effect estimate, and few epidemiologists are comfortable with the implementation of existing methods.

Second, the established methods require presentations of systematic error that are lengthy, ^{25} and so are too unwieldy to incorporate into data summarization and inference. The quantitative assessments of random error require little additional space for presentation of an apparently rigorous measurement of residual random error.

Last, the automated analytic tools often used by epidemiologists provide quantitative assessments of residual random error surrounding effect estimates, ^{26} but contain no such automated method of assessing residual systematic error.

Sensitivity analysis has been recommended for all analyses of observational data sets ^{27} as a rational alternative to ignoring residual systematic error in epidemiologic research. Basic methods for sensitivity analysis have been explained, ^{25} and advanced techniques proposed, ^{28} but these have not yet overcome the aforementioned barriers. The method proposed here has the advantage of using existing and familiar software and presentation methods, thereby addressing the latter two barriers.

#### METHODS

##### Sample Data Set

To illustrate our method, we used a study of the effect of less-than-definitive primary therapy *vs* definitive therapy on breast cancer mortality. The investigation ^{29} and an initial sensitivity analysis ^{30} have been previously published. In this analysis, we show how to use the semi-automated method of sensitivity analysis to assess systematic errors attributable to selection bias, misclassification and confounding. We did not seek to assess all of the possible systematic errors at work in this data set, although a more comprehensive assessment with our method would be possible.

The study population included 494 female breast cancer patients diagnosed at 8 Rhode Island hospitals between July 1984 and February 1986. ^{31} We ascertained the vital status of subjects by matching their identifying variables to the National Death Index and the Social Security Administration database of active social security transactions. We assigned the breast cancer mortality outcome to subjects with a death certificate containing the International Classification of Disease, 9th revision code 174, for breast cancer as the underlying cause of death or as one of the contributing causes of death. We assigned the date of the last follow-up as the date of death recorded on the death certificate for known decedents (N = 105; 69 attributed to breast cancer) or as the date 5 years after diagnosis for subjects with no National Death Index match.

All treatments received during the first year following diagnosis were documented for each patient. We defined definitive primary therapy for women with local disease as receiving mastectomy or breast-conserving surgery plus radiation therapy. ^{32} We required the same primary therapy for women with regional disease, and also required chemotherapy or hormonal therapy. We classified women who did not receive minimum primary therapy as having had less-than-definitive primary therapy.

We adjusted for two confounders: (a) age at diagnosis, in categories of 45-64 years, 65-74 years and 75-90 years; and (b) stage of disease, abstracted from medical records and categorized as local or regional. We used Cox’s proportional hazards regression to estimate the effects of less-than-definitive care on the outcome, adjusted for the two confounders. Hazards were proportional only during the first 5 years, and so we conducted analyses over the first 5 years of follow-up.

##### Sources of Systematic Error

##### Selection Bias

The original investigators removed identifying variables from the data set after the enrollment project ^{31} was completed. We re-identified subjects for the follow-up study by matching unique patient characteristics to the cancer registry of the Hospital Association of Rhode Island. For each match, the Hospital Association reported to the present investigators the identifying variables necessary to complete the follow-up. The Hospital Association of Rhode Island re-identified 390 of the original 449 patients (87%) with local or regional disease. The probability of re-identification did not strongly depend on patients’ age, cancer stage, co-morbid disease status (defined as the number of co-morbid diseases) or receipt of definitive care (data not shown). The probability of re-identification did depend on the hospital of diagnosis, because two affiliated hospitals participated in the Hospital Association of Rhode Island Cancer Registry for only the latter part of the enrollment period. It may be that the effect of less-than-definitive therapy, compared with definitive therapy, on breast cancer mortality is different among the women who were not re-identified than among the women who were re-identified.

##### Misclassification of a Covariate

The tumors of women who did not receive an axillary dissection may have been incorrectly staged. These women might have had regional disease, but because their lymph nodes were staged clinically rather than pathologically, they may have been misclassified as having local disease. Clinical assessment might also misclassify local disease as regional disease. Stage determines the criteria for definitive therapy, and so an incorrect assessment of stage may lead to misclassification of both the stage of disease (a confounder) and the treatment (the exposure).

##### Unmeasured Confounding

When treatment is not assigned by randomization, the potential exists for unknown or unmeasured confounders to bias the effect estimate. In observational research comparing the efficacy of treatments on their targeted disease, confounding by indication is of primary concern. Women who received less-than-definitive therapy may have had indications for receipt of that therapy, and these indications may also be related to a higher risk of breast cancer mortality.

##### Sensitivity Analysis

Ideally, the data set we create would be identical to that which would have been observed had the three sources of systematic error been absent. No such ideal data set can be reconstructed with confidence, and so we created multiple reconstructions, each yielding an estimate of the relative hazard of less-than-definitive therapy, compared with definitive therapy, adjusted for one or all three sources of systematic error. The theoretical basis for multiple imputations to assess error has been previously described. ^{33} The method of reconstruction depended on the systematic error or combination of errors under consideration. Figure 1 displays a flowchart of the dataset reconstruction process. We used SAS ^{26} for all analyses and graphing. The SAS code and sample data set are available at http://www.bumc.bu.edu/epi/lash&finksensitivityanalysis.

##### Selection Bias

For the selection bias, ideally we would learn the breast cancer mortality status of the women who were not re-identified. Instead, we guessed the status of these women—informed by the outcomes of the women who were re-identified—and repeated these guesses multiple times. We modeled the log-odds of breast cancer mortality using the re-identified participants, for whom vital status at 5 years was known, as a function of age, stage, therapy, co-morbid disease and hospital group. We set hospital group equal to zero for the hospitals that participated in the cancer registry throughout the enrollment period and equal to one for the two hospitals that did not. The model yielded a vector of coefficients and its covariance matrix. In a single reconstruction, we set the parameter vector, except the therapy parameter, equal to its maximum likelihood estimate plus the product of a randomly selected vector of standard normal deviates and the square root of the covariance matrix. We used this parameter vector to calculate an estimate of risk of breast cancer mortality for each non-re-identified women (N = 59) over the 5 years of follow-up, and we conducted a Bernoulli trial using the risk to model whether she died of breast cancer. For example, a Bernoulli trial for a woman with a risk of 50% would be like a coin flip, with heads representing a breast cancer death and tails representing survival over the 5-year follow-up. We randomly assigned a survival time from the distribution of survival times observed in the re-identified women classified as breast cancer decedents. We assumed that women classified as non-breast cancer decedents survived the 5 years of follow-up. With non-re-identified women now assigned survival outcomes, we included all in the Cox’s proportional hazards regression (N = 449). The sensitivity analysis of this systematic error incorporated two sources of variability. First, in each reconstruction of the data set, we selected different parameter estimates for the women not re-identified, yielding different risks of death. Second, the results of the Bernoulli trials differ for non-re-identified women in each reconstruction, yielding a different subset of the population assigned the outcome.

##### Misclassification of a Covariate

For the bias due to misclassification of stage, ideally we would learn the pathologic lymph-node status of the women whose lymph nodes were staged clinically. Instead, we guessed the pathologic node status of these women, as informed by their clinical stage and literature reports of the sensitivity and specificity of clinical node staging, with pathologic staging as the gold standard. We created triangular probability density functions ^{26} to represent the sensitivity and specificity of clinical examination to assess axillary node status, with pathologic examination as the gold standard. We abstracted the published literature values of the sensitivities and specificities. ^{30} We parameterized the triangular probability density functions with the minimum and maximum values reported in the literature, and with the mode equal to the inverse variance-weighted average of all the reported values. Sensitivity ranged from 13% to 83%, with a mode of 46%. Specificity ranged from 41% to 98%, with a mode of 85%. We modeled these error rates as independent of a woman’s status on other variables that could affect the error rate (*eg*, death due to breast cancer), and as independent of errors in classification of other variables, although such dependencies could be easily incorporated.

To perform a single reconstruction, we selected a sensitivity and a specificity from the respective triangular probability density functions. From the sensitivity, we calculated the positive predictive value for women who were staged clinically, and from the specificity we calculated the negative predictive value. The predictive values are also a function of the true prevalence of node-positive disease in the population under study. ^{34} The prevalence of node-positive disease in this study among women who were staged pathologically equaled 41% (139 node-positive cases out of 342 cases who received an axillary dissection). To approximate that prevalence, we selected the prevalence of node-positive disease from a triangular probability distribution with minimum prevalence of 30%, maximum prevalence of 50% and mode of 41%. For each of the clinically staged women classified as having local disease, we conducted a Bernoulli trial using the negative predictive value to model whether she was correctly or incorrectly classified. Similarly, for each of the clinically staged women classified as having regional disease, we conducted a Bernoulli trial using the positive predictive value to model whether she was correctly or incorrectly classified. We reclassified the stage and primary therapy for the women selected as having been misclassified, and then subjected the reconstructed data set to Cox’s proportional hazards regression to estimate the effect of less-than-definitive care. This sensitivity analysis incorporated four sources of variability. For each reconstruction, different values of the prevalence of node-positive disease, and of the sensitivity and specificity of clinical staging were selected from the triangular probability distributions. In addition, the results of the Bernoulli trials differ in each reconstruction, yielding a different subset of the population with reclassified stage.

##### Unmeasured Confounding

For the bias due to confounding by an unknown or unmeasured confounder, ideally we would learn the value of the confounder for each woman. Instead, we created a dichotomous variable to represent confounding by indication in this data set and guessed its value for each woman, as informed by her exposure and outcome status. We chose a population prevalence from a uniform distribution with a lower limit of 30% and an upper limit of 40% for those who received definitive therapy and did not die of breast cancer, from limits of 45% and 55% for those who received less-than-definitive therapy and did not die of breast cancer and for those who received definitive therapy and died of breast cancer, and from limits of 60% and 70% for those who received less-than-definitive therapy and died of breast cancer. For each subject, the combination of disease and exposure determined the probability of having the unknown confounder. To perform a single reconstruction, we conducted a Bernoulli trial for all subjects, based on their probabilities, to assign whether they had the dichotomous confounder. We then subjected the reconstructed data set to Cox’s proportional hazards modeling, with the model now containing the unknown confounder. With each reconstruction, different population prevalences are selected, and the Bernoulli trials select a different subset of the population to have the unmeasured confounder.

##### Presentation of Sensitivity Analysis Results

Each data-set reconstruction yielded an estimate of the effect of less-than-definitive therapy compared with definitive therapy on breast cancer survival over 5 years of follow-up. As reconstructions accumulated, we plotted three cumulative probability functions against the log of the relative hazard to display the results. First, we plotted the conventional result, reflecting only random error, for purposes of comparison. The relative hazard where a horizontal line plotted at the 50th percentile intersected the cumulative probability function equaled the conventional result. The relative hazards where horizontal lines plotted at the 2.5 percentile and 97.5 percentile intersected the cumulative probability function equaled the limits of a conventional 95% confidence interval. Second, we plotted the cumulative probability function from the relative hazards accumulated from the iterations of the sensitivity analysis, reflecting only systematic error, and reported the relative hazards at the same intersections. Last, we plotted the latter cumulative probability function with a bootstrapped estimate of total error, and reported the relative hazards at the same intersections. We generated the bootstrap estimates by re-sampling the study population, with replacement from the base population after reconstruction, but before estimating the relative hazard.

We wished to assess whether sufficient reconstructions had accumulated to characterize the sensitivity of the analysis to the sources of error under scrutiny. Our primary objective was to characterize accurately the width of the plotted distributions, so we chose a convergence criterion that focused on the tails of the cumulative probability distributions. When the widths of the 90% confidence intervals about the 5th and 95^{th} percentiles of the log-hazard-ratio distribution were both less than 1/20 of the width of the range between the 5th and 95^{th} percentiles, we determined that the Monte Carlo simulation had converged.

#### RESULTS

The table displays the results of the analyses. The original analysis of the data set, with no assessment of systematic error, yielded a hazard ratio of 2.0 (95% confidence interval, 1.2-3.4) associating less-than-definitive primary therapy with breast cancer mortality over 5 years of follow-up.

The sensitivity analyses suggest that the original effect estimate was biased away from the null by the systematic errors. For the analysis of sensitivity to selection bias, the median relative hazard equaled 1.8 and the 95% simulation interval equaled 1.5 to 2.2, reflecting systematic error. The analysis of sensitivity to misclassification of stage and to unknown confounding yielded similar results.

The Table also presents the results of the sensitivity analysis incorporating all three sources of systematic error. The median relative hazard equaled 1.5 (95% simulation interval for systematic error, 1.2-1.9; 95% bootstrapped simulation error for systematic and random error, 0.8-2.8). The analysis required approximately 4,000 reconstructions to sufficiently characterize the sensitivity to all three sources of systematic error.

Figure 2 displays all of the evidence provided by the conventional analysis and the sensitivity analysis. The dotted/dashed line shows the cumulative probability function for the conventional result, centered on the log of 2.0 and intersecting horizontal lines at 2.5% and 97.5% at the limits of the 95% confidence interval. The dashed line shows the cumulative probability function for the sensitivity analysis, reflecting only systematic error. The solid line shows the cumulative probability function for the systematic error, incorporating random error as well using the bootstrap technique. Comparison of the systematic error (dashed line) with the conventional result (dotted/dashed line) shows that few of the reconstructions (1.4%) yielded estimates of effect greater than the original result. This comparison illustrates that the conventional confidence interval does not reflect systematic error. Comparison of the bootstrap result (solid line) with the conventional result (dotted/dashed line) illustrates the bias due to systematic error.

#### DISCUSSION

In the absence of prior information, a common interpretation of the original result might be as follows: “The point estimate of the relative hazard of less-than-definitive therapy, compared with definitive therapy, on breast cancer mortality is 2.0 with a 95% confidence interval of 1.2 to 3.4. The observed data are most likely when the true effect equals the point estimate, and the data would be unlikely under the null. The data are equally consistent with effects greater than 2.0 as with effects less than 2.0.” In contrast, the interpretation of the sensitivity analysis might begin with the same first sentence, but then add: “Under our model for systematic error, the observed data would be likely if the true hazard ratio was in the range 0.8 to 2.8, and most likely if that ratio were 1.5. These data are more likely if the true effect is less than 2 than if the effect is greater than 2.”

Our sensitivity analysis interpretation differs from the original in several important ways. First, our sensitivity-analysis interpretation recognizes the range of estimates of effect consistent with the data, whereas a conventional interpretation tends to focus on a single point estimate. Second, the conventional interpretation makes an explicit statement about the probability of observing the data under the null hypothesis, which is a customary, though not inherent, component of assessments of random error. Our sensitivity analysis interpretation makes no such statement, instead treating the null as one of the infinite number of hypotheses within the range considered. Third, the conventional interpretation suggests that the data are equally consistent with effects greater than or less than the point estimate. Our sensitivity analysis suggests that the data are more consistent with effects less than the point estimate than with effects greater than the point estimate.

An objection to our method might be its reliance on underlying assumptions about the nature of the systematic errors. We respond that investigators who make no quantitative assessment of systematic error assume that there is no systematic error. Quantitative assumptions about known sources of systematic error should be more informative, on average, than an assumption that they have no influence. Furthermore, common definitions of frequentist assessments of random error, such as the confidence interval, assume a valid statistical model and no bias in the effect estimate. ^{35} Observational epidemiologic research seldom satisfies the first assumption because exposure status is not assigned by randomization. ^{36} The second assumption cannot hold in the presence of systematic error. The assumptions underlying our method seem more tenable than those that underpin the confidence interval’s description of random error, which is nonetheless routinely presented.

Several advantages of our method weigh against the objection. It yields a graphical display of the study’s results, which can also be presented efficiently in a manner similar to the usual presentation of study results. In place of the point estimate of effect, one would report the median relative hazard of the bootstrapped sensitivity analysis. In place of a confidence interval to describe random error around the point estimate of effect, one would report two additional intervals: (a) the sensitivity analysis simulation interval, and (b) the bootstrapped sensitivity analysis interval.

Our method offers familiar contexts to epidemiologic methodologists, which is its second advantage. To begin, it requires counterfactual thinking, the foundation for much of causal inference in epidemiology. ^{16,37} In this case, one strives to create a data set that would have been observed had there been no systematic error. In addition, our method can incorporate data interchanges and influence analysis, two techniques recommended to address the aforementioned failure of observational research to satisfy the assumptions of frequentist statistics. ^{36} Last, one can envision our method as a means to reverse the distortion of information passing through the “episcope” recently described by Maclure and Schneewiess. ^{38}

Finally, our method allows investigators to quantitatively assess sources of systematic error, rather than focusing only on random error, as they seek to understand the uncertainty about an estimate of effect. This is its third advantage. Although any valid method of assessing important sources of systematic error would be better than ignoring it, our method provides both efficient presentation and sufficient information, a combination not achieved by existing alternatives. For example, many investigators discuss the limitations of their results in a Discussion section, but few provide any quantitative assessment of the impact of the limitations. In the absence of quantification, the confidence interval assessment of random error may substitute for the entire error assessment. In some cases, investigators quantitatively estimate bias attributable to one or several sources of systematic error. This assessment is akin to presenting alternative point estimates. In the absence of an interval, however, the focus remains on the point estimate and the confidence interval, which assesses only random error.

An extension of the preceding method of sensitivity analysis is to create plausible scenarios for systematic error and to calculate revised point estimates under each scenario. ^{25} Separate sources of systematic error usually receive separate analyses. Although this technique presents a range of values for consideration, the reader has no sense of the likelihood of each scenario and, therefore, no sense of which revised estimates of effect are most compatible with the data. Furthermore, this technique requires a substantial amount of text or table space to be fully presented.

A Monte Carlo technique recently proposed ^{28} addresses the shortcomings of the last alternative. It provides a comprehensive assessment of all the sources of systematic error, provides an understanding of the estimates of effect most compatible with the data, and its results can be economically presented. A disadvantage of the technique is its distance from the original data. It re-estimates the effect that would have been observed in the absence of the systematic error by probabilistically adjusting the original effect estimate for the bias of the systematic error. When the original data are available, as it is to original investigators, our method is more easily implemented and more likely preserves the interrelationships between sources of error. ^{38} The alternative treats sources of error as independent, or assigns correlations between errors that would be difficult to accurately quantify. The results of the method, with estimates of the bias distributions empirically derived from our results, did not replicate our sensitivity analysis results (data not shown).

Although our method validly describes systematic error, we recognize that additional development should be undertaken. Assessment of bias arising from differential misclassification should be readily implemented, as it requires only different probability distributions within strata of variables. Extensions to case-control designs may require specialized enhancements, particularly in matched studies. Finally, the method should be extended to incorporate Bayesian assessments, in which context the technique will likely fit more comfortably. Figure 3 illustrates with an example in which a prior distribution centered on the null is modified by the results of the sensitivity analysis to yield a posterior distribution that suggests an adverse effect of less-than-definitive therapy.

More importantly, we encourage epidemiologists to practice and publish some quantitative sensitivity analysis to assess systematic error in their research. Any method will be susceptible to valid criticism. In the absence of such assessments, however, systematic error will continue to receive short shrift in the selection of results worthy of publication and development of public health policy. Assessments of random error, and particularly statistical significance testing, will remain the focus of these objectives until epidemiologists regularly supply coherent and concise descriptions of the systematic error surrounding their estimates of effect.

#### ACKNOWLEDGMENTS

We thank Sander Greenland and Ken Rothman for their thoughtful reviews of drafts of the manuscript.