Phillips, Carl V.
The results of health science studies have profound effects on consumer and policy decisions, determining what drugs we can take, what exposures are regulated, and how diseases are treated. Given the multitude of exposures, outcomes, and target populations, it is inevitable that attention to almost every specific question is limited, leaving little chance to eliminate uncertainty about study results. Nevertheless, decisions must be made, based on whatever information is available.
Optimal decision making requires knowing not just point estimates and random error quantification for effect measures, but the total uncertainty. Reporting P-values or confidence intervals as summary measures of uncertainty in epidemiology implies that the errors they measure are the only source of uncertainty worth measuring or reporting. However, unknown levels of systematic error from measurement, uncorrected confounding, selection bias, and other biases increase the uncertainty, and may even dwarf the random sampling error. (The term “random sampling” includes random allocation in experimental designs. I avoid the term “random error” because, as noted below, some systematic errors introduce additional random error that is not captured in the usual summary measures of uncertainty.)
Section I of this paper argues the benefits of quantifying such error, Section II explores various approaches and their philosophical bases, and Sections III and IV sketch a method for quantifying the probabilities associated with uncertainty from systematic errors.
I. WHY QUANTIFY SYSTEMATIC UNCERTAINTY?
Even the highest quality studies have errors, and failure to quantify the resulting uncertainty simultaneously overstates the importance of findings and diminishes the contribution of health research to the public welfare. Quantifying systematic uncertainty allows more accurate (and honest) reporting of scientific findings and offers several practical benefits for improving the contributions of epidemiology.
Research results are typically presented as if policy can be made based only on statements of certainty. But decisions are best made based on quantified uncertainty and the resulting expected value calculations. 1–3 Health policy makers have long had the tools to take into account quantified uncertainty, but have seldom been given the necessary data.
Directing Further Research
Many epidemiologic studies conclude that further research is needed. But what further research and how much new information can we expect to gain from it? Quantified uncertainty can show what particular further research might be of value. 3,4
Improving Nonexperts’ Understanding of Research Results and Resulting Decisions
Lay readers tend to take study results as facts (unsurprisingly, since results are typically reported in the language of fact) and then interpret the reported uncertainty (from random sampling only) as a complete accounting of the uncertainty. People alter their behavior based on these “facts,” and become confused and cynical when old “facts” are replaced by contradictory new “facts.” Reporting uncertainty would show readers that they cannot make much sense of individual studies and that they need to either gain deeper understanding of the body of work or focus on expert syntheses. Policy makers—not immune to popular misunderstandings—would be less likely to overreact to new “facts” and (possibly more important) might avoid waiting for that magical rejection of the null before acting, even when the cumulative evidence suggests that action is warranted. 3,5
II. EPISTEMOLOGY OF ERROR QUANTIFICATION
When systematic errors are addressed in health research reports, it is almost always as subjective unquantified discussion. It is important to realize that “subjective” is not pejorative. Scientific inquiry, from instrument design to data analysis, is a series of subjective judgments, ideally based on the best available information. Even quantifying random sampling error, usually considered objective, depends on believing that the necessary mathematical assumptions are approximately true.
The problem is not subjectivity, but the lack of quantification. Decision makers focus on available numbers, often not looking at the accompanying prose. The most-read parts of health research reports—press releases, abstracts, tables—typically ignore systematic error, perhaps for lack of a parsimonious method for reporting it. Unquantified claims that a suspected error is small or in a particular direction force readers (who are largely unqualified to judge) either to blindly accept the assertion or to reject the findings entirely. A substantial literature in cognitive psychology and economics shows that experts are typically far too confident in their beliefs, 6,7 calling into question authors’ unquantified assertions.
Target-Adjustment Sensitivity Analysis
A simple method of error quantification is to calculate the magnitude of a single bias necessary for the corrected effect measure to be a certain value (typically the null). For example, for the odds ratio represented in Table 1 (discussed below) the researcher might report that the observed association could be a result of exposure measurement—3% false-positives among cases and no other systematic errors—and then discuss whether this is plausible. Such sensitivity analyzes sometimes appear in discussions of results or subsequent reanalyses.
I label this method target-adjustment sensitivity analysis (TASA) because it is based on reaching a particular adjusted level of effect measure. The question “Is it plausible that this error could be so large that?” is epistemologically similar to frequentist hypothesis testing; that is, it is based on a similar philosophy of how we come to have knowledge. The implicit question in both is “Can we accept this potential explanation for the observed association, thus providing a noncausal explanation?”
TASA is easy to understand and calculate. Authors’ assertions that a particular bias does not matter can be backed up by a transparent presentation of what level of bias would matter (eg, instead of “It is not plausible that exposure measurement error explains the association,” the above calculation would make it necessary to say, among other things, “It is implausible that the false-positive rate for cases was 3% points higher than for noncases.”).
But this information is necessarily limited. An error-rejection epistemology addresses only the plausibility of an alternative hypothesis, which helps little if decision makers are interested in plausible ranges for the effect measure (let alone associated probabilities), not just whether the null is plausible. For example, even when the null is plausible it is often still a good idea to take regulatory action. 3,5 In addition, it is difficult to assess the interaction of multiple errors. While we can calculate combinations of errors that together drive the corrected parameter to the null, we cannot test the plausibility of the infinite possible combinations. An additional rhetorical downside is that TASA creates the impression that assessing systematic error is about trying to make an observed effect go away rather than trying to improve measurement.
Bias-Level Sensitivity Analysis
An alternative approach to error quantification, which I label bias-level sensitivity analysis (BLSA), specifies values of bias parameters and calculates the resulting adjusted effect measure. Some authors suggest reporting a table of such corrected effect-measure estimates for one or more sources of error. 8,9 The table allows the author or reader to judge which bias parameter levels seem plausible. Multiple sources of error can be combined in a straightforward manner (though each increases the dimensionality of the table, so it becomes unwieldy to have man, eg, five sources of error, with 10 values considered for each, results in a 5-dimensional 100,000 cell table). Returning to the example based on Table 1, we can look at the corrected odds ratios (Ors) for various possible values of realized specificity, as shown in Table 2. More extensive examples appear elsewhere. 10
Although this method seems like a simple extension of TASA, it actually requires a fundamental epistemologic departure. Instead of being based on the judgment “Is this explanation for the association plausible?”, the BLSA calls for formation of beliefs about the probabilities of various states of the world. Unless the reader restricts herself to searching the table of corrected values for the null and doing TASA, she is forming an opinion about the probabilities of various bias levels.
Superficially the BLSA calculation seems objective—the adjusted measures are derived mathematically from each hypothetical set of bias levels—but the numbers are meaningful only when filtered through someone’s opinion of the probability of particular bias levels. Merely looking at the results without these subjective judgments is not informative, since systematic error of the right type and magnitude can result in any mathematically possible effect measure.
Quantifying Probability Distributions of Bias Levels
Epidemiologic analysis is usually concerned with measurement, rather than just dichotomous consideration of a null hypotheses. Thus, it should not be problematic to express probabilities of certain values, such as particular levels of bias, rather than restricting ourselves to Popperian or Pearsonian statements about an observed phenomenon not being due to error. It is a fairly small epistemologic leap from the “probabilities of” thinking implicit in BLSA, to quantifying probability distributions across levels of error and reporting the distribution of resulting corrections. This moves beyond sensitivity analysis (which asks “What if we made a certain error?”), and makes quantified uncertainty part of the estimated result itself (recognizing that error is inevitable). The math is similar, but there is a critical distinction: The probability distribution of possible true values is not a supplement to a study result, of secondary importance; rather, it is the study result.
The balance of this paper sketches a method for quantifying systematic error and presenting the resulting uncertainty distribution of corrected parameter estimates. A complete assessment of uncertainty surrounding a study result would combine this with the uncertainty resulting from random sampling error. This latter step is beyond the scope of the current paper. In addition to adding another computational step, it requires consideration of what random errors means in nonrandomized studies and of Bayesian versus frequentist philosophies of statistics. (In earlier versions of this analysis, we used Bayesian updating to incorporate random sampling error, 11,12 and in related work I used artificially constructed prior beliefs to demonstrate practical implications of these methods. 3 Greenland 13 has more formally addressed the relationship of Bayesian analysis and Monte Carlo-based sensitivity analysis. Recent papers have incorporated Bayesian analysis into related methods, 14 including one published in this issue of Epidemiology. 15 Other approaches have used some variation on frequentist confidence intervals. 16,17 For results where the random sampling error is small compared with other sources of uncertainty, it can simply be omitted from the calculations. 18) Setting aside consideration of stochastic error, this paper presents the approach in a way that is easy to carry out, compatible with different notions of stochastic processes, and grounded entirely in bias-correction calculations.
III. FROM CAUSAL EFFECT TO DATA AND BACK AGAIN
In cumulating the uncertainty resulting from multiple sources of error, it is useful to first consider how a causal relation transforms into data. This is illustrated in Figure 1 (which we have presented previously 11,12 and has proven to be an effective teaching tool in itself; Maclure 19 offers an interesting alternative conceptualization). Moving from left to right illustrates the accrual of some of the systematic and random errors in an observational study. The first box represents the true causal relation we want to estimate. This can be thought of as the counterfactual (unobservable) experiences of the target population under two different exposure scenarios. 20,21 For clarity, the discussion is restricted to a binary exposure and binary outcome, but the implications are generalizable.
The second box represents the theoretically-observable real-world population of interest, which involves confounding of the true causal relationship. For the 2×2 example, confounding simply means that the true values for the entire real-world population of interest have cell counts that do not reflect the true causal relation. As with other sources of error, attempts to correct for confounding do not eliminate the need to quantify the uncertainty from unknown levels of uncorrected or overcorrected confounding.
The third box represents the sampling from this actual population. This introduces random sampling error and systematic selection biases. The latter include biases in recruitment, participation, and loss to follow-up. Finally, the true values for the studied subpopulation will be imperfectly measured (exposure and disease misclassification) in generating the final data in the fourth box. Additional steps (eg, model specification, extrapolation to other target populations) would generate further uncertainty.
Observing how the errors accumulate shows how to nest multiple corrections to recreate the true value from the data, applying well-known correction equations to go from right to left in Figure 1. If the ordering in the diagram correctly conceptualizes the order in which errors enter a particular study, then it also describes the order in which corrections should be made. (The order in which errors are dealt with, which will sometimes make a substantial difference in results, is typically based on computational convenience, without consideration of the proper order. For example, when the implications of possible measurement error or selection bias are discussed in a study, the discussion is usually based on results that are already adjusted for confounders. For further discussion, see Greenland. 8, pp. 356–357)
Correcting for a specific value for each bias parameter provides a single BLSA result, as presented above. Introducing distributions for each of the bias parameters yields associated probabilities for the corrected values. For example, instead of merely calculating the corrected parameter estimate if the disease specificity is exactly 0.95, we can calculate the probability distribution of corrected values if we believe the specificity is somewhere in the range [0.9, 1.0], with probabilities distributed as a triangular distribution with the mode at 0.95.
This calculation can provide information available from sensitivity analyzes (using a single bias can facilitate TASA; single values for each error recreates BLSA). But it also provides something potentially much more important: the probability that the bias-corrected parameter is in some range of interest. This quantification is available for any decision-relevant range, such as levels that warrant regulatory intervention or comparisons to alternative therapies. 3 (These calculations would ultimately require inclusion of random sampling error, as previously discussed.)
Practicality of Input Distributions
Generating the necessary input distributions for levels of systematic errors requires labor-intensive expert judgment, review of the literature, validation studies, and other efforts. This challenge provokes objections to the proposed method: How can we know the distribution of measurement specificity and a dozen other bias measures? The model results can only be as good as the inputs. Because we do not typically know how big our errors might be (the argument might run) we cannot do this kind of analysis.
There are several responses to these concerns. First, we have no choice but to form subjective judgments about levels of bias. If we genuinely have no opinion about how large biases might be, then the true value of an effect measure could be anything, whatever the study results, and research has no value. Researchers demonstrate opinions about the distribution of bias levels every time they make claims about the direction and potential magnitude of biases, and with the mere act of claiming that their results reflect reality. The proposed method requires a more detailed assessment than might typically be made, but I believe that forcing researchers to carefully consider the magnitude of biases is a benefit rather than a cost. Reporting results without expressing any uncertainty about systematic errors makes the implicit claim that errors are zero; any studied estimate of bias level distribution is likely to be an improvement.
Second, it is often easier to form an opinion about the required inputs, such as levels of measurement error or exposure patterns of nonrespondents, rather than the resulting bias and corrected parameter estimate. Third, when the effects of bias are discussed, transparent presentation of distributions is better than black-box assertions. Even if the input distribution was never used in a calculation, quantifying it would improve the understanding of authors and readers.
Finally, it is important to avoid the excuse that “We cannot do this perfectly, so we should not do it at all.” The distributions will necessarily be rough estimates, and we should not consider the results of these calculations precise (an ironic mistake, given the central point that more uncertainty exists than typically recognized). But using our best possible distributions will produce a result that is more useful for many purposes than anything we currently generate. Furthermore, only the use of such distributions will lead us to improve our ability to generate them.
IV. MONTE CARLO SIMULATION
Moving from right to left in Figure 1 to correct for a specific level of each error is easy. But nesting distributions for each error would be extremely difficult to do in closed-form (analytically). Numerically simulating the resulting distribution of corrected effect measures offers a practical alternative. A single value is drawn from each distribution and the resulting corrected estimate is calculated. Iterating this process a large number of times and collecting the values into intervals produce a histogram that approximates a density function for the true causal relationship. Monte Carlo (random-number based) simulations are a common way of calculating aggregate uncertainty and have been used extensively for decades in engineering (particularly risk assessment), business and financial analysis, and similar fields. 1,22–24
For an analysis (such as the present one) that avoids using prior beliefs about the effect measure, the input distributions should be formed without reference to the point estimate (eg, an estimate lower than expected should not change the distributions), though they could be informed by intermediate information such as the nonresponse rate. This means that the reported results must be interpreted as the distributions of bias-corrected values of the causal effect, based on a prior distribution of errors (formulated before viewing the data). This is not a complete accounting of uncertainty and does not make complete use of available information, but it is a useful partial accounting.
I illustrate this approach using results from one of the most influential recent epidemiologic papers, the report of the Hemorrhagic Stroke Project case-control study which linked the decongestant and diet aid, phenylpropanolamine (PPA), to hemorrhagic stroke, and led directly to that popular drug’s removal from the U.S. market. 25 (Note that this example is intended to illustrate the proposed method, not to be the best possible reanalysis of the study’s data.) The results included the doubling of stroke odds for 18–49-year-old women who used PPA for roughly 3.5 days before the stroke. Table 1, used in the previous examples, is based on that result. The unadjusted OR is 2.1; the original analysis included a correction for confounding (which cannot be replicated from published information), yielding an OR of 2.0 and a 95% confidence interval (CI) of 1.0–3.9.
I used a Monte Carlo simulation to quantify uncertainty from exposure measurement and selection bias. Error corrections were made by correcting population values (cell counts) for each box in Figure 1. The corrections for exposure measurement error move subjects between the exposed and unexposed cells in the 2×2 table to reflect the true value. Corrections for selection bias consist of estimating characteristics for the entire (and thus selection-bias-free) target population and putting them in the appropriate boxes. This approach to error correction has several advantages compared with multiplying by an adjustment factor. It is constructive and so generates the intermediate corrected simulated data needed for nesting further error corrections. Additionally, it maximizes flexibility in accepting different forms of inputs for error distributions. The simulation used the off-the-shelf software, Crystal Ball (Decisioneering, Denver, CO), with 250,000 iterations for each reported result.
In the original study, 25 exposure classification was based on recall by cases and matched noncases (contacted through random digit dialing) of their consumption of PPA-containing pharmaceuticals. Figure 2 shows a distribution of measurement-error-corrected ORs based on the following distributions: The probability distribution for each measurement error is triangular with the mode at the mean of the range. Case sensitivities are distributed over [0.9,1], noncase sensitivity over [0.7,1] (to reflect lower incentive to correctly recall), and both specificities over [0.99,1]. The distributions are uncorrelated for cases and noncases because the very different circumstances of the recall.
A binomial process determines the number of misclassified individuals (eg, for a sensitivity of 0.92, the number of false negatives from 100 positives is not always 8, but is rather a binomial distribution with n = 100 and P = 0.08 because sensitivity is the probability that a given individual will be correctly classified). The misclassified are moved to the correct cell to simulate true values for the sample. Figure 2 shows that correcting the OR for these (fairly optimistic) distributions of errors results in a wide range of plausible values, with about 90% of the probability mass (shaded darker) falling between 1.6 and 3.0. (The scale is chosen for comparison with Fig. 3. The distribution is not unimodal because of the “lumpiness” created by the small number of exposed subjects.)
Figure 3 represents the distribution of corrected ORs after combining corrections for 3 likely sources of selection bias with the preceding correction for exposure measurement. Selection bias might result from the omission of approximately 310 victims who died or were otherwise unable to communicate. Because the outcome of interest is all stroke, not mild stroke, these individuals are part of the target population. It is possible that their exposure pattern differed systematically from the rate for the included cases, so their rate of exposure was drawn from a normal distribution with a mean of the (measurement-error-corrected) rate for included cases, with a standard deviation of 0.2 of that value.
Individuals in the target population who experienced very mild strokes may not have been identified, possibly introducing bias because publicity about the PPA-stroke link might increase the chance that exposed individuals with mild symptoms would be diagnosed. This is reflected by hypothesizing that there were 100 such cases, and the their exposure rates were drawn from a triangular distribution, with a minimum of zero, a mode of the corrected rate for the controls (the population-average exposure rate) and a maximum of the corrected rate for included cases. Both groups of missed target cases are added to the observed cases to get simulated true values for the population of interest.
Based on the number of reported strokes and the average rate of stroke for young women, 26 the target (catchment) population is about 2 million, so the control selection proportion is about 0.03%. Selection bias is introduced by limitations of random digit dialing and the possibility of less-frequent (or more-frequent) phone answering by women currently using diet pills or cold medicine (because of lifestyle or illness. This is modeled by positing that exposed controls were differentially selected, with a rate distributed normal, with a mean of 0.8 of the average rate controls and standard deviation 0.2 of that. This results in a simulated target population of noncases and produces an adjusted OR. As expected, these sources of uncertainty widen the distribution compared with Fig. 2, with about 90% of the probability mass in Fig. 3 falling between 0.8 and 2.6.
This process could be continued by adding the uncertainty generated by other uncertain levels of error, but this example is sufficient to demonstrate the quantification of the distribution of error-corrected parameter estimates. I leave analysis of the policy implications relating to PPA for discussion elsewhere, other than to observe that had a distribution of possible systematic uncertainty been calculated and reported, it could have improved the quality of the resulting public policy debate and decisions.
The process described here is not trivial. Improved and more-complete methods, user-friendly computation tools, and the skills to improve the generation of input distributions are needed, and will be developed only by using this type of analysis. But this method is feasible, and compared with the overall cost of a typical research project (let alone the impacts the research might have on policy and consumer decisions), the cost is small.
For many decisions, this type of quantification is what is needed, though it should be kept in mind that the usefulness of an answer depends on what question is being asked. For some purposes, TASA and attempts to reject an alternative hypothesis are useful. Sometimes we want corrected effect measures for particular bias levels. But only quantification of the probability distribution of true values can completely represent a study result. Optimal policy, clinical, and behavioral decisions that trade off the potentially high costs of exposures against the high costs of avoiding exposures often require such probabilistic uncertainty. Quantifying uncertainty distributions in the manner suggested here makes the level of uncertainty easy to understand and report, and provides probability values that can be used in many applications.
Quantifying uncertainty does not create uncertainty. It merely measures and reports the uncertainty that is always there. This is not a matter of making a tradeoff, of accurately reporting uncertainty at the expense of reducing the value of our findings. Quite the contrary, quantified uncertainty better describes what we know, and thus can facilitate better decisions, suggest improvements in our methods, and help direct new research to where it will provide the most benefit.
I acknowledge George Maldonado as a coauthor of previous versions of this paper and thank him for teaching me his insights about the importance of epidemiologic error and methods for quantifying it. I also thank Karen Goodman, Sander Greenland, Malcolm Maclure, Charlie Poole, Jay Kaufman, Irva Hertz-Picciotto, Corinne Aragaki, and students from several classes at Texas and Minnesota.
© 2003 Lippincott Williams & Wilkins, Inc.