The goal of comparative effectiveness research is to support optimum patient-centered care. At the same time, with health care costs a national priority, comparative effectiveness research is expected to help inform the decisions of patients, clinicians, and payers in a way that will reduce the rate of cost growth and ensure that the highest value is obtained for every health care dollar. Among the most important challenges facing the research community is how to make comparative effectiveness reports “fit for purpose”—how to frame and format the results to provide actionable guidance for decisions that can improve the quality and value of health care.
Exactly how comparative effectiveness reports are to provide guidance—while complying with the legislative fiat not to “mandate coverage, reimbursement, or other policies for any public or private payer”—has been the focus of considerable debate.1 In steering clear of explicit coverage recommendations, there is a risk that comparative effectiveness reports will summarize evidence and point out evidence “gaps” without offering enough tangible guidance for patients, clinicians, or payers.
This paper describes an approach that seeks to navigate a middle path by using a rating system that integrates evidence-based judgments of the comparative clinical effectiveness and the comparative value of a defined use of a medical intervention. We present the methodological framework for this “integrated evidence rating” system and early experience of its application to medical policies by public and private payers. We then discuss the implications of this experience for the translation of comparative effectiveness research findings in the future.
THE INSTITUTE FOR CLINICAL AND ECONOMIC REVIEW INTEGRATED EVIDENCE RATING
The rating system presented in this article was developed at the Institute for Clinical and Economic Review (ICER), a hospital-based academic research group providing independent evaluations of the comparative clinical effectiveness and comparative value of medical interventions. ICER reviews have 2 primary components: (1) a systematic review of the evidence on comparative clinical effectiveness; and (2) a de novo decision analytic model, with base case parameters largely derived from the findings of the systematic review. These findings are presented alongside those of the systematic review to provide further insights into clinical effectiveness and information on the cost effectiveness and economic impact of different approaches to adopting the intervention(s) being considered.
ICER reports use the evidence from the systematic review and decision analytic model to address the key questions that frame the comparative effectiveness appraisal. To translate the findings of the review into a format that can help all stakeholders apply them in practice and policy, ICER reports present 2 summary ratings: one rating of comparative clinical effectiveness and one of comparative value for the specific intervention(s) being reviewed. These 2 ratings are derived, discussed, and presented separately in the body of the report, but the method ICER has developed also combines them into an integrated evidence rating that seeks to convey a summary judgment that captures both important perspectives—effectiveness and value—in a transparent format that can be more easily understood and applied to decisions by patients, clinicians, and other stakeholders.
The ICER Integrated Evidence Rating is depicted in Figure 1. It is constructed as a matrix, with a vertical axis denoting the possible categories for a rating of comparative clinical effectiveness, and the horizontal axis divided into 3 possible rating categories for comparative value. It is important to note that these ratings are specified in ICER reports as comparing specific uses of medical interventions; that is, there may be different ratings for different uses of a test, treatment, or other intervention depending on the specified indication and patient population(s). In the sections later, we describe the approach ICER uses to translate its broader comparative effectiveness reviews into each of these 2 component ratings.
Level of Certainty in a Comparative Net Health Benefit
The underlying approach to ICER's rating of comparative clinical effectiveness mirrors that developed by the United States Preventive Services Task Force (USPSTF) in its most recent methods documents, and is dependent upon a joint judgment of the level of certainty provided by the body of evidence and a categorical judgment of the magnitude of the comparative net health benefit.2 To render this 2-part judgment both explicit and transparent, ICER uses a “Comparative Clinical Effectiveness Matrix” very similar to that used by the USPSTF. This matrix, depicted in Figure 2, was developed independently (although with some overlap in participants with the USPSTF effort) and pilot-tested specifically for comparative clinical effectiveness assessments by a multistakeholder evidence-based medicine roadmap group.3,4
The vertical axis of the comparative clinical effectiveness matrix rates the level of certainty that the evidence provides in the precision of the net health benefit. There are 3 categories: high, moderate, and low, the same categories used by the USPSTF. Table 1 presents the definitions ICER uses to distinguish among these levels of certainty, along with a description of the types of evidence that would generally qualify as providing high, moderate, and low certainty.
While the vertical axis represents a judgment of certainty, the horizontal axis of the Comparative Clinical Effectiveness Matrix displays gradients of the estimated net health benefit provided by a health intervention compared with the net health benefit of the selected comparator intervention. The categories for comparative net health benefit begin at the far left with “negative”; as the estimate of net health benefit increases, the rating moves to “comparable,” then to “small net benefit,” and culminates with a rating of “substantial” comparative net health benefit.
The term comparative “net” health benefit is used because of the importance attached to an explicit judgment of the overall balance of benefits and risks between an intervention and its selected comparator(s). The rating of net health benefit on the horizontal axis of the Comparative Clinical Effectiveness Matrix represents the best conceptual “point estimate” ICER can make given its interpretation of the existing evidence. As with the approach taken by the USPSTF, ICER has at this time no set definition of the boundaries between “comparable,” “small,” and “substantial” comparative net health benefit. For example, if the results of the appraisal include an estimate of a small lifetime QALY advantage for one intervention compared with another, balanced against known greater short-term risks, whether or not these findings should be judged as conferring a comparative net health benefit will depend on many features of the relative certainty of the benefits and harms, as well as value judgments of the importance to patients of small QALY gains over a lifetime. Despite the variability that will attend these judgments, presenting a categorical judgment of net health benefit serves an important goal: it enhances understanding of the underlying evidence by forcing the review team to justify its rating. The review team must describe more concretely than they might otherwise their view of how the disparate findings of a systematic review and decision model sum up. The review team's justification can be debated and disagreed with, but in all cases it will give decision-makers a more clear insight into the key issues they should consider when summing up the evidence and applying it to particular clinical actions or policies.
Summary Rating of Comparative Clinical Effectiveness
As shown in Figure 2, the Comparative Clinical Effectiveness Matrix maps the 3 categories of certainty upon the categories of comparative net health benefit to define a summary rating of comparative clinical effectiveness. Here, the relationship between level of certainty and magnitude of net health benefit comes into sharper relief. With a high level of certainty, the point estimate of net health benefit in one category is relatively assured, and therefore each cell in the matrix on the row of high certainty has a distinct label. A technology whose evidence base provides high certainty of a moderate-to-high net health benefit is rated to have “superior” comparative clinical effectiveness. As the net health benefit diminishes, the rating of comparative clinical effectiveness shifts to “incremental,” then “comparable,” and finally “inferior.”
When the level of certainty in the comparative net health benefit is only moderate, however, uncertainty about either benefits or harms is such that the precision of the net health benefit is significantly reduced. This lack of precision is akin to a broader “conceptual confidence interval,” and is illustrated in the matrix by the broader summary categories of Unproven with Potential (U/P) and Insufficient (I).
The U/P category is a particularly important element of the Comparative Clinical Effectiveness Matrix. This category is intended to indicate a judgment that the available evidence can only yield moderate certainty in the comparative net health benefit at the population level, but that the best estimate is that there is either a small or substantial net benefit. Moderate certainty implies that the point estimate of net health benefit is unlikely to shift more than one category in either direction; thus, a U/P rating implies a judgment that there is relatively high certainty that the comparative net health benefit is comparable or better, and a correspondingly relatively small possibility that future evidence would demonstrate that the true net comparative benefit of the intervention being assessed is inferior to its comparator.
The final summary category of comparative clinical effectiveness is the “I” category that sweeps from the moderate certainty of a point estimate of comparable or inferior net health benefit into the entire bottom row in which certainty in net health benefit is so low that there remains a reasonable probability that the true net health benefit is inferior; in other words, that the intervention being evaluated produces a net harm for many or most patients.
Rating Comparative Value
The rating of comparative clinical effectiveness can stand alone, to be discussed and applied by decision makers, but it also forms the first of the 2 parts of the ICER Integrated Evidence Rating. The second component is a rating of “comparative value.” ICER rates the use of interventions for particular patient populations as having “high,” “reasonable or comparable,” or “low” comparative value.
ICER does not employ a single measure of cost effectiveness, such as the incremental cost-effectiveness ratio, for assignment of a rating of comparative value, and therefore does not rely on a formal cost-effectiveness threshold. Instead, the rating of comparative value is informed by multiple measures of potential economic impact. To distill and communicate the key clinical and economic variables that constitute these measures related to comparative value, ICER has developed a format that we call the Comparative Value Evidence Tables (CVET).
The CVET has 3 component tables, each presented on a single page to facilitate its use in discussions among stakeholders. Space constraints do not allow presentation here of an example of each of these tables, but a fully rendered example can be seen in the ICER report on active surveillance and radical prostatectomy for localized prostate cancer.5 The first table of the CVET compiles the findings on key risks and benefits, and presents simple absolute differences in these outcomes between the 2 (or more) comparators. The second table presents comparative cost information. Costs are presented and absolute differences compared for the acquisition or performance cost of the intervention and comparator(s), of all major associated cost components of the care pathways (eg, doctor visits and hospitalizations), and of each of the key side effects or adverse outcomes associated with treatment. These figures are followed in the second table by cumulative costs for a short-term perspective (eg, 2 years) and for lifetime costs of each of the major categories of costs.
The third and final table in the CVET contains summary data on service impact, cost per life-year saved, and cost per quality-adjusted life year (QALY) according to a lifetime perspective, budget impact per 1000 patients over 2 years, and what are termed “fixed budget tradeoffs.” Fixed budget tradeoffs are examples of other services, such as nurse and physician full-time equivalent salaries that could be paid for with the same amount of additional money that might be spent on the more expensive of the interventions being compared. An example of this final table of the CVET, with data from the ICER report on treatments for localized prostate cancer, is shown in Table 2 .
The overarching goal of the CVET is to help translate and augment the findings of a cost-effectiveness analysis into a modular format that can be understood more easily by stakeholders and serve as a tool to engage them in active discussion and application of the findings. Providing a format in which the individual components of cost effectiveness are highlighted is important for several reasons. First, sole reliance on summary measures of cost effectiveness such as cost per QALY may mask the magnitude of the differential QALY gain between 2 different interventions. For example, the cost per QALY may appear to be “reasonable” based on widely used incremental cost-effectiveness thresholds (eg, $50,000 per QALY gained), but this ratio may represent a very small marginal cost for a very small estimated QALY gain, or it might represent a larger cost for a very substantial clinical improvement. ICER's reports reflect the belief that these distinctions should be made transparent to all decision makers; and ICER's rating of comparative effectiveness highlights these findings rather than glossing over them in what might otherwise be an overemphasis on the summary measure of cost per QALY.
A second reason the components are highlighted in the CVET is that they provide a broader scope of findings that may relate more directly to the perspective on value of different stakeholders. For example, whereas payers may be most interested in expressions of the clinical value achieved for the additional investment provided (eg, cost per QALY, cost per adverse event averted), integrated health systems may ascribe equal or more importance to measures of budgetary or system impact, and patients may be most interested in differential rates of downstream testing or other service use. All of the information in the CVET is described in greater detail in the body of the ICER report, along with additional information on the source for utility measures, but our experience has been that the CVET helps crystallize the key findings in a more accessible format.
To determine a final rating of “high,” “reasonable/comparable,” or “low” value, ICER considers all of the economic findings, including the relative uncertainty of model findings as explored through multiple deterministic sensitivity analyses and a probabilistic sensitivity analysis. To aid transparency, ICER provides general guidance that incremental cost per QALY ratios of less than approximately $50,000 will often be considered as indicative of a “high” value intervention; incremental cost per QALYs from about $50,000 to $150,000 would often fit within a designation as “reasonable” values; and incremental cost per QALYs above $150,000 would be more likely to suggest “low” value interventions. This general guidance is based upon previous academic work benchmarks modified by ICER's interpretation of evidence on the role medical inflation and societal willingness to pay should have in creating cost-effectiveness thresholds.6,7 While there is a limited normative or empiric basis for the loose boundaries ICER presents, these boundaries also reflect input from stakeholders in today's health care system on how best to present incremental cost-effectiveness ratios within broad categories that can be widely understood, gain relative consensus, and be actionable.
In the ICER experience with performing comparative effectiveness reviews using this rating format, several important challenges have arisen. First, the ICER ratings require a single, clearly identified comparator for whatever intervention or interventions are being considered. Thus, if multiple alternative interventions are being compared, one of them must be selected as the common comparator. For example, in the ICER appraisals of interventions for localized prostate cancer, the first report focused on radiation therapy alternatives, using intensity modulated radiation therapy (IMRT) as the common comparator on which ratings were based for brachytherapy (Ca) and proton beam therapy (Ic). But later, when ICER reviewed active surveillance and different versions of radical prostatectomy, it determined that open radical prostatectomy should be the common comparator for ratings of active surveillance (Ca) and robotic prostatectomy (Ub). To consider the ratings of the 2 reports together is challenging unless a single common comparator is selected for all interventions.
Another challenge inherent in any rating format is that an intervention may merit more than one rating for clinical effectiveness and value across identifiable patient subpopulations and indications. During the preliminary scoping phase for each assessment, ICER works with clinical experts and others to carefully define the patient populations and intervention comparators that should be looked at separately. It remains a priority of our reviews to highlight different findings for different patient populations and indications, preventing the rating format from being interpreted as a single, overly broad determination.
A third challenge that arises in any grading or rating approach is that it always involves a subjective element; judgments about strength of evidence, net health benefit, and value cannot be condensed neatly into an empiric algorithm, set of criteria, or formula. Selecting a single rating can also be difficult when there are acknowledged gray areas between the different categories in the rating scheme. The best answer to this challenge is to make the inherent judgments as transparent as possible, and to describe how various elements of evidence were weighed in arriving at summary ratings. ICER's experience with its rating format has been that the process of making and justifying these judgments, while challenging, has helped engage stakeholders deeply in the overall appraisal process; it has helped them understand the nuances of the evidence while ultimately providing them with a format that is easier to share with others as a possible guide to practice and policy.
APPLICATIONS TO DECISION MAKING
The ICER Integrated Evidence Rating is intended to facilitate the application of comparative effectiveness evidence to clinical practice and policy through coverage decisions by payers and through other mechanisms, including patient decision-support aids, clinical guidelines, payment policies, and value-based insurance benefit designs. The ratings are built to allow a pluralistic health care system, with its various decision makers, to apply them to practice and policy in a more explicit, transparent fashion.
For example, the ICER Integrated Evidence Ratings on localized prostate cancer treatment options are supporting the development of aligned patient decision aids and payment changes being designed by a coalition of purchasers, payers, and providers in Massachusetts.8 In addition, ICER reports have been used to guide coverage decisions by private health plans and by a public decision-making body, the Washington state Health Care Authority (HCA), which is legislatively instructed to consider safety, clinical effectiveness, and cost effectiveness, in making coverage decisions for the state's Medicaid and other public departments.9 For the Washington HCA, ICER has performed 2 reviews: (1) CT colonography for colorectal cancer screening; and (2) coronary CT angiography (CCTA) for the detection of coronary artery disease.10,11 In the case of CT colonography, ICER supplied draft ratings of “comparable” comparative clinical effectiveness and “low ” comparative value. The Washington HCA concurred with these judgments and decided not to cover CT colonography unless significant new evidence appeared or a different price could be negotiated in the future.
In the appraisal of CCTA for the detection of coronary artery disease, the ICER draft integrated evidence rating was “Ca” for the use of the test to triage patients in the Emergency Department, and a second rating of “Ub” was giving for the use of CCTA in the evaluation of patients with stable chest pain in the outpatient setting. The Washington State HCA decided to cover CCTA in the Emergency Department but not in the outpatient setting.
Rating or grading formats have become a ubiquitous part of evidence review processes. But most often the job of the evidence review team is—in the vein of “globalize the evidence, localize the decision”—to provide grades on the strength or quality of evidence but to avoid presenting other judgments, including most notably that of net health benefit or comparative value.12 Does the ICER rating approach go too far in assuming a role that should be left solely to decision makers? Clearly, a balance must be struck between trying too hard to appear “impartial” by providing only an undigested compendium of study findings, and going to the other extreme of making judgments that negate the role of clinicians and patients in sorting out how the evidence should be applied in the context of individual clinical decisions. The ICER rating format seeks to avoid these twin pitfalls. Rather than just compiling and synthesizing evidence, the ICER rating format provides an accompanying interpretation of the results. This interpretation must be clearly articulated and justified, but it is still just an interpretation by ICER, and the ultimate goal is to provide decision makers with a format that gives them a “first draft” interpretation and allows them to debate specific points more efficiently and consider how they wish to construct their own final judgment.
ICER's experience to date has shown that this approach to using a rating format for interpretation of both comparative clinical effectiveness and comparative value has proven acceptable and useful for application in medical policy decision making by a limited number of public and private payers. It remains to be seen whether ICER's ratings will be well-suited for clinical guideline development or for communications with patients and their families. It may be difficult to arrive at a single format that works equally well for patients, clinicians, and payers, but the legitimacy and the ultimate impact of comparative effectiveness will be greatly enhanced if a single basic approach can be devised and tailored slightly to meet individual stakeholders’ needs. Continued work will be required to make the assignment of ratings as explicit and transparent as possible so that all stakeholders can understand the complex interplay of evidence and judgment that is required to produce any comparative effectiveness review.
1.Wilensky GR. Cost-effectiveness information: yes, it's important, but keep it separate, please! Ann Intern Med
2.Sawaya GF, Guirguis-Blake J, LeFebre M, et al. Update on the methods of the U S Preventive Services Task Force: estimating certainty and magnitude of net benefit. Ann Intern Med
3.Berger ML, Mamdani M, Atkins D, et al. Good research practices for comparative effectiveness research: defining, reporting and interpreting nonrandomized studies of treatment effects using secondary data sources. Value Health
5.Institute for Clinical and Economic Review. Active Surveillance and Radical Prostatectomy for Clinically Localized, Low-Risk Prostate Cancer. Available at: http://icer-review.org/index.php/as-rp.html
. Accessed August 13, 2009.
6.King JT Jr, Tsevat J, Lave JR, et al. Willingness to pay for a quality-adjusted life year: implications for societal health care resource allocation. Med Decis Making
7.Braithwaite RS, Meltzer DO, King JT Jr, et al. What does the value of modern medicine say about the $50,000 per quality-adjusted life-year decision rule? Med Care
10.Institute for Clinical and Economic Review (ICER). Final Appraisal Document: CT Colonography for Colorectal Cancer Screening. January, 2008. Available at: http://icer-review.org/index.php/ctc.html
. Accessed August 21, 2009.
11.Institute for Clinical and Economic Review (ICER). Final Appraisal Document: Coronary Computed Tomographic Angiography for Detection of Coronary Artery Disease. January, 2009. Available at: http://icer-review.org/index.php/ccta.html
. Accessed August 21, 2009.
12.Eisenberg JM. Globalize the evidence, localize the decision: evidence-based medicine and international diversity. Health Aff (Millwood)
13.Atkins D, Eccles M, Flottorp S, et al. Systems for grading the quality of evidence and the strength of recommendations I: critical appraisal of existing approaches. The GRADE Working Group. BMC Health Serv Res
14.Guyatt GH, Oxman AD, Vist GE, et al; GRADE Working Group. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ
15.Guyatt GH, Oxman AD, Kunz R, et al; GRADE Working Group. GRADE: going from evidence to recommendations. BMJ
16.Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions—Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol.
17.Tunis SR, Pearson SD. Coverage options for promising technologies: Medicare's coverage with evidence development. Health Aff (Millwood)
A Comparison of the ICER Methodology With Other Major Methods for Grading Comparative Clinical Effectiveness
The most common approach among evidence review groups, both in the United States and internationally, has been to use some kind of ranking or “grading” system to describe the evidence and justify how well it supports conclusions or recommendations for specific clinical applications.13 Grades have been assigned to all levels of evidence reviews and evidence-based decision making, from grades on the quality of individual studies, to grades on the strength of a body of evidence, to grades on the strength of recommendations made on the basis of the evidence.
In designing a grading system tailored to support evidence-based coverage decisions, our efforts were informed by the approaches developed by 3 influential evidence rating programs: (1) the United States Preventive Services Task Force (USPSTF); (2) the international Grading of Recommendations Assessment, Development and Evaluation (GRADE) group; and (3) the Effective Healthcare Program of the Agency for Healthcare Research and Quality (AHRQ).
The GRADE methodology was developed to support guideline recommendations. Its approach first produces grades for the “quality of evidence” for each outcome of interest. These grades are then considered, along with the balance of harms and benefits, values and preferences of patients, and costs, to determine whether a guideline recommendation is ultimately labeled as “weak” or “strong.”14,15 The AHRQ Effective Health Care program produces evidence reports that are intended to support independent decisions by a wide spectrum of government agencies, professional societies, and other stakeholders.16 The AHRQ evidence reports follow the GRADE approach closely in grading the strength of evidence for each outcome, but purposefully leave it to decision makers to judge the balance between harms and benefits and to consider additional contextual issues in a comparison of alternative interventions.
The USPSTF is an independent group convened by AHRQ to provide recommendations on the use of clinical preventive services. Reflecting this purpose, the USPSTF, whose deliberations are supported by AHRQ evidence reviews, folds a judgment of the quality of evidence into a more prominent assessment of “certainty in net health benefit.”2 This grade is then combined with a rating of the magnitude of net health benefit suggested by the evidence to create an overall recommendation grade of A, B, C, or D; an I grade is assigned when evidence is judged to be insufficient to make any recommendation at all. Each grade is linked to standard recommendation language: an “A” grade, for example, indicates that the USPSTF “strongly recommends that clinicians routinely provide the service to eligible patients”; a “B” grade is a weaker but still positive recommendation, and so on.
The EBM Roadmap group concluded that certainty in net health benefit, and not quality of evidence, serves as a better term to describe how payers judge the evidence underlying a particular coverage decision. We also concluded that a rating of evidentiary certainty of any net benefit was not adequate for transparency in linking evidentiary assessment to a coverage decision. Therefore, in a conceptual approach highly similar to that put forth by USPSTF, the EBM Roadmap Group created a two-axis “EBM matrix” to guide and describe judgments of comparative clinical effectiveness as part of coverage decisions (Fig. 2). The matrix consists of a vertical axis on which “level of certainty” is rated in one of 3 categories; the horizontal axis contains categories of net health benefit. An overall judgment of the comparative clinical effectiveness of a health intervention reflects a joint assessment and explicit rating of these 2 components. These axes are described in greater detail below, with general guidelines for distinguishing between the different levels of certainty and magnitudes of net health benefit.
Rating the Level of Certainty
Table 1 presents a definition of levels of certainty along with a description of the types of evidence that would generally qualify as providing high, moderate, and low certainty. These descriptions are built upon considerations of a body of evidence that are embedded in different ways in the USPSTF, GRADE, and AHRQ grading methodologies. The USPSTF presents a list of 6 questions to be considered; GRADE and AHRQ list a series of criteria of quality or of strength of evidence, respectively. Incorporating many of the same general considerations, we designed a list of 7 potential limitations in a body of evidence that could reduce certainty (Table A1). We believe that evidence review groups and decision makers can be most transparent in their rating of a level of certainty by recording the presence and the relative importance of these 7 potential limitations in the body of evidence they are assessing. Clarity and transparency will be enhanced if these limitations are examined and reported separately for the evidence available on benefits and the evidence on harms.
Estimating Net Health Benefit
The horizontal axis of the EBM matrix displays gradients of the net health benefit provided by a health intervention compared with the net health benefit of the selected comparator intervention. The EBM Matrix uses the term comparative “net” health benefit because of the importance attached to explicit judgment of the balance of benefits and harms. While explicit, however, this framework does not specify the method by which net benefit should be determined. Evidence review organizations and policy decision-makers use widely varying methods to quantify harms and benefits. Some organizations construct a formal mathematical comparison of 2 health interventions by summing all harms and benefits into a common metric, most often the Quality-Adjusted Life Year, or “QALY.” Whichever intervention produces the highest number of QALYs is preferred on clinical grounds. Most clinical guideline groups and evidence review groups in the United States, however, rarely sum harms and benefits into a single number in this fashion. Instead, a common approach is to apply implicit, qualitative weightings to all the different harms and benefits. Often, therefore, the judgment of whether there is a comparative net health benefit between 2 interventions remains less than fully transparent.
To improve both consistency and transparency, the EBM Roadmap Group endorses the recommendations of GRADE, AHRQ, and the USPSTF which encourage evidence reviewers and policy decision makers to quantify the magnitude of benefits and harms to the greatest extent possible. Empirical comparison of QALYs or other summary measures may be useful guides to an estimation of net health benefit, particularly when 2 interventions being compared have very similar profiles of types of benefits and harms. However, such analyses can be complex and require many assumptions; therefore, empiric measures that do not merge all harms and benefits together should be used, even when QALYs or another composite measure are also being compared. Examples of noncomposite measures useful for comparison of benefits and harms include number needed to treat (the number of people who would need to be treated for some defined period to prevent one adverse health event) and number needed to harm (the number of people who would need to be treated for a defined time for one adverse health event to occur).
Summary Ratings of Level of Certainty in a Magnitude of Net Health Benefit
As can be seen in Figure 2, the EBM matrix maps the 3 categories of certainty upon the categories of net health benefit to define a summary rating of comparative clinical effectiveness. Here the relationship between level of certainty and magnitude of net health benefit comes into sharper relief. With a high level of certainty, the point estimate of net health benefit in one category is relatively assured, and therefore each cell in the matrix on the row of high certainty has a distinct label. A technology whose evidence base provides high certainty of a moderate-to-high net health benefit is rated to have “superior” comparative clinical effectiveness. As the net health benefit diminishes, the rating of comparative clinical effectiveness shifts to “incremental,” then “comparable,” and finally “inferior.”
When the level of certainty in the net health benefit is only moderate, however, this indicates that uncertainty about either benefits and/or harms is such that the precision of the net health benefit is significantly reduced. This lack of precision is akin to a broader “conceptual confidence interval,” and is illustrated in the matrix by the broader summary categories of Unproven with Potential (U/P) and Insufficient (I).
The U/P category is a particularly important element of the EBM matrix. This category is intended to indicate a judgment that the available evidence can only yield moderate certainty in the comparative net health benefit at the population level, but that the best estimate is that it is either small or substantial. Our definition of the U/P category includes the small possibility that future evidence would demonstrate that the true net comparative benefit is inferior to other alternatives for many or all patients. Interventions that have an evidence base that would fall into this general category might be similar to implantable cardio-defibrillators (ICDs) in the years 2000–2003, when the Centers for Medicare and Medicaid Services determined that the evidence was not adequate to allow a clear judgment that ICDs were “reasonable and necessary” for coverage by Medicare.17 The evidence was not adequate (not high level certainty), but the judgment was that certainty was good enough (moderate) of a net health benefit to warrant coverage with evidence development. The U/P category therefore provides a conceptual basis for identifying technologies in an evidentiary “middle ground,” and thus can provide a more explicit and transparent description of judgments to support conditional coverage and reimbursement decisions beyond the “cover/no-cover” option.
The final summary category of comparative clinical effectiveness is the “I” category that sweeps from the moderate certainty of comparable or inferior net health benefit into the entire bottom row in which certainty in net health benefit is so low that there remains a reasonable chance that the true net health benefit is inferior; in other words, that the intervention being evaluated produces a net harm for most or all patients.