Promising results from recent trials of oral and topical pre-exposure antiretroviral prophylaxis (PrEP) have bolstered hopes that antiretroviral (ARV)-based methods will be a cornerstone of HIV prevention efforts in the future. Nevertheless, it is clear that in the near term, there will be no HIV prevention panacea.1,2 Instead, the emphasis in prevention research has shifted to the evaluation of combination prevention packages. In this paradigm, biomedical, behavioral, and structural interventions are implemented concurrently, such that synergies among interventions could lead to substantial effectiveness overall.3,4 For example, biomedical approaches will be rolled out in concert with behavioral programing (eg, ARV pre-exposure prophylaxis—oral or vaginal—plus risk-reduction counseling) to ensure uptake and adherence. On a structural level, a wide variety of interventions may contribute to the success of a combination prevention package, from supply chain and operational programing needed to ensure the availability and quality of commodities and services [eg, condoms, HIV testing and counseling (HTC)], to social, economic, and political interventions that address the risk environment faced by vulnerable populations.4-6 However, these kinds of combination programs pose 2 significant evaluation challenges: how best to determine the impact of such combined prevention packages and how (and whether) to measure the effectiveness of component strategies given that they are implemented in parallel with other interventions.
Numerous high-quality guidelines for large-scale impact evaluations exist,7-10 but none have explicitly addressed the methodological issues unique to HIV prevention programs delivered alone or in combination. These issues include the long causal pathway between intervention and impact, the indeterminate accuracy of intermediates on the causal pathway (eg, self-reported behaviors),11,12 the inconsistent relationship between HIV and surrogate outcomes such as sexual risk behavior, sexually transmitted infections, and pregnancy,11,13 and the relative rarity of HIV as an outcome. Together, these issues further complicate an already complex evaluation problem.
Nevertheless, evaluation of the impact of combination HIV prevention programing is an urgent global need.14,15 As opposed to “efficacy”, which measures a program's individual-level effect under highly controlled conditions, “impact” (or community-level effectiveness) is the effect of a program on a population level as measured by changes in incidence, prevalence, mortality, and/or other ultimate outcomes of interest. Most evaluations of HIV prevention interventions have failed to implement rigorous designs or adequately assess impact, and many are further weakened by reliance on insufficient surrogates for HIV incidence. The poor evidence base for HIV prevention is sustained by a persistent myth in the HIV prevention community that rigorous experimental and quasi-experimental methods to measure impact are unrealistic. Yet, in contrast, these methods have been widely adopted and are highly regarded in other areas of health and development.16-20 This article outlines the challenges of evaluating combination HIV prevention programs and proposes a framework to guide impact evaluation design.
UNIQUE CHALLENGES IN EVALUATION OF HIV PREVENTION PROGRAMS
Numerous impact evaluation publications have considered the difficulties of evaluating combination interventions. The need to measure outcomes, rather than process variables, and the preference for experimental designs whenever feasible is well established.7-10,21,22 However, applying this guidance to the evaluation of both individual and combination HIV prevention interventions poses unique methodological challenges.
Large Sample Sizes and Long Time Horizons
The relative rarity of HIV as an outcome results in lengthier, more expensive evaluations when HIV incidence is the outcome of interest.1 In the absence of a reliable incidence assay, individuals must be followed prospectively to directly measure HIV incidence, resulting in long evaluations with large sample sizes. Indeed, adequately powering a study to detect changes in HIV incidence may only be possible in a few high prevalence areas in Africa. The sample size needed to detect a difference in HIV incidence may be 10 times larger than the sample needed to detect a change in sexually transmitted infection (STI) incidence and 200 times larger than the sample required to detect a change in behavior.23 However, the overreliance on insufficient surrogates (such as self-reported behavior or STIs) for HIV infection has resulted in considerable uncertainty about program effectiveness.
In addition to added cost, lengthy evaluations also run the danger of conflating program effects with changes in the natural course of the epidemic or the context in which the intervention is being implemented. These changes could be driven by unanticipated implementation of new interventions, increased availability of treatment, changes in the epidemic phase in which the evaluation is conducted or unforeseen secular changes.
In concentrated epidemics, combination prevention programs may target risk groups or geographical areas that vary widely in population size and incidence, adding complexity to the evaluation challenge. In addition, because sampling frames are unavailable, program implementation must follow natural clusters or geographic units, such as gay bars or prisons. Determining the appropriate sample size is particularly important in concentrated epidemics, given the heterogeneity of HIV prevalence within subgroups of risk and the fact that many of these subgroups are hard to reach.
Lack of Naïve Control Groups
As we approach the 30-year mark of the HIV/AIDS epidemic, an unprecedented amount of resources has been dedicated to treatment and prevention. For example, reported US President's Emergency Plan for AIDS Relief (PEPFAR) expenditures on HIV prevention, treatment, and care from FY 2006-2009 were $10.5 billion, of which $2.5 billion (23.8%) was allocated to prevention programs alone.24 In PEPFAR focus countries and other regions where extensive prevention programing has already been implemented, “naive” control groups are unlikely to exist. Thus, evaluations can only measure the marginal effect of the combined programs above and beyond existing programs. In practical terms, this means that evaluations of combined prevention packages may involve comparisons with “business-as-usual” or existing government programs as the counterfactual comparison. The need for a control group introduces a potential ethical issue where certain communities receive no more than standard practices for at least some period of time, while others receive a new program. Evaluation schemes that use stepped wedge designs or phased roll-out25,26 are optimal, as they capitalize on the logistical reality that programs typically cannot be rolled out everywhere at once while ensuring that all geographic areas eventually receive the intervention.
Poor Surrogates for HIV Infection
A good surrogate endpoint for HIV infection must be highly sensitive (ie, its presence is highly suggestive of HIV infection) and specific (ie, its absence is suggestive of lack of infection) and can be accurately and reliably measured (as has been noted in the chronic disease literature27). Should such an ideal surrogate exist, it could circumvent the persistent problem of lengthy evaluations to measure HIV incidence. Unfortunately, self-reported behavior, STIs, and pregnancy are all imperfect surrogates, which can supplement, but not replace the need to directly or indirectly measure impact on HIV incidence, infections averted, prevalence, and/or mortality.
The use of self-reported sexual behavior as a surrogate for HIV transmission depends on how well sexual behavior can be measured and whether high-risk behavior correlates with HIV infection.12,23,28 Self-reported behavior is often not highly associated with HIV infection,11-13 an observation that may be due to social desirability bias, measurement error, sexual network characteristics (which can increase risk independently of individual behavior), as well the possibility that people have safe sex with risky partners and risky sex with safe partners.23,28 For HIV infection, the causal effect of sexual risk behavior on HIV acquisition is entirely dependent on exposure to the virus (eg, regardless of risk behavior, HIV cannot be sexually acquired unless there is contact with an infected partner), limiting the validity of behavioral outcomes as an ideal surrogate for HIV.
The use of STIs as a surrogate for HIV assumes that STIs unequivocally lie on the causal pathway, a finding not borne out from rigorous randomized controlled trials (RCTs) evaluating the effect of STI treatment for HIV prevention, which have had mixed results with the majority showing no effect on reducing HIV incidence.1,29 Further, differential infectivity between HIV and STIs and pregnancy limit their use as reliable proxies of impact. Yet until reliable HIV incidence assays are available, short-term outcomes in the form of externally verifiable biologic outcomes (eg, pregnancy in the case of HIV prevention among teenagers, incidence of bacterial STIs among sex workers) may be suitable surrogates for interim monitoring to make mid-course program corrections, with the ultimate goal to measure impact on HIV itself as the primary outcome.
KEY CONSIDERATIONS IN IMPACT EVALUATIONS OF COMBINATION HIV PREVENTION
The following sections outline the essential features of impact evaluations of large-scale combination HIV prevention programs.
Defining the Evaluable Package
The first step in any impact evaluation is to define the evaluable package. Realistic combination prevention packages might include (1) existing programs implemented over an entire region where an evaluation would occur (eg, mass communication); (2) existing programs that would need to be enhanced for the evaluation (eg, intensifying efforts of male circumcision); (3) existing programs that are modified to increase demand and uptake (eg, adding incentives for HTC); and (4) new programs (eg, PrEP). Evaluation of combination programs should focus on the “entire” program including what “already exists” and what might be “added or enhanced”.
Determining Whether to Evaluate Each Component Versus the Entire Package
Evaluators must strike a balance between complexity and simplicity when deciding which, if any, component programs in a combination prevention package require independent assessment of impact. On one hand, disaggregating impact to allow for causal attribution to specific components should facilitate the development of efficient programs, enabling program planners to discontinue the least effective components and strengthen the most effective components. However, this is complicated by the inextricable link between certain interventions. For example, HTC is the first step in identifying persons living with HIV/AIDS for ARV therapy and HIV-negative individuals who might begin PrEP; thus, it is an essential gateway to treatment and prevention services. In addition, distinguishing the independent effects of different intervention components can only be done at the cost of increased sample sizes, resources, and time. On the other hand, if the entire package is ineffective, examining the effect of component parts largely becomes moot, assuming program components are not antagonistic. Therefore, evaluation of the complete package should take priority in terms of resources and design.
Choosing a Study Design With a Valid Counterfactual
A significant but essential challenge in any impact evaluation is identifying a valid “counterfactual”—the hypothesized scenario of what would have happened if the program had not been implemented. Experimental methods, quasi-experimental methods, and mathematical modeling each offer advantages and disadvantages in the identification of a counterfactual for a combination prevention program.
Experimental evaluations, which randomly allocate the “treatment” and “comparison” conditions, are considered the gold standard of causal evidence. They typically have the most internal validity and the least selection bias as both observed and “unobserved” confounders are balanced between study arms.30 However, there has been much debate in the health promotion field about the feasibility of using randomized designs to assess the population-level impact of social or behavioral interventions and complex programs.8,21,31-33 Regrettably, much of this debate has been caused by a confusion between RCTs and randomized forms of implementation and evaluation. Unlike RCTs, which evaluate the efficacy of biomedical interventions (eg, vaccines) under highly controlled settings for regulatory purposes and thus often lack external validity, randomization can indeed be conducted in real-world settings (with limited added cost) to examine multifaceted programs that include structural and social interventions. The growing body of literature on successful health-related randomized impact evaluations highlights the feasibility and value of experimental approaches to combination interventions.17,18,20,34-38
Randomized impact evaluations often randomize entire communities to treatment and comparison groups, and many use stepped wedge or phase-in designs that allocate locations to receive the new prevention package based on time.25,26 These designs capitalize on the logistical reality that programs typically cannot be rolled out everywhere at once, although ensuring equity by eventually delivering the intervention to all geographic areas. Communities that do not initially receive the program serve as the comparison group until program implementation. An example is the evaluation of the “Progresa” program in Mexico, a conditional cash transfer programs for poor households based on school attendance and regular health care visits.17,39 Using an experimental stepped wedge design, the program was sequentially rolled out in 506 communities. The rigorous evaluation and beneficial results ensured that the antipoverty program continued within Mexico even after a change in administration, and it has become one of the most studied social programs in the world (since renamed Oportunidades). Moreover, it had a major impact on global policy as evidenced by numerous conditional cash transfer programs that were later adopted as antipoverty programs in more than 2 dozen countries, a testament to the importance of rigorous evaluation.40
In the absence of random allocation of a program to individuals or communities, the impact evaluation toolkit includes quasi-experimental methods that generate a valid counterfactual without random allocation through the use of statistical methods or modeling, which permit causal attribution of outcomes to the program. Such methods partially mitigate selection and confounding biases by using statistical methods to control for “observed” characteristics that may differ between the intervention and comparison groups (either in the design or analysis phase).22 However, quasi-experimental methods remain less robust than experimental methods because “unobserved” factors may still bias the measures of effect. In addition, some quasi-experimental techniques require large datasets. Given the lack of existing data for most at-risk groups, straightforward randomized approaches may be more suitable in evaluating programs that target these populations.
Nevertheless, rigorous quasi-experimental impact evaluations have been conducted for large combination programs and offer a suitable alternative when randomization is not possible either due to political will or logistical constraints (eg, the program has already been rolled out). A recent evaluation of a pre-existing combination hygiene and sanitation intervention in Southern India used propensity score matching to construct a plausible counterfactual. The study benefited from the availability of preintervention national census data and preintervention household surveys on water sources and hygiene behavior, which contained detailed information on major confounders in the treatment and potential control villages. The evaluators were then able to construct a 25-village matched sample that was well balanced on all observable confounders. They found that the program had no impact on ultimate outcomes of interest (prevalence of child diarrhea and child height).41
Mathematical modeling can be used to construct a counterfactual in specific cases when data from large, population-based, repeated cross-sectional serosurveys are available. Briefly, data from 2 consecutive household serological surveys measuring HIV prevalence and ARV therapy coverage are compared to yield the population level epidemiological change that can be disaggregated into incidence, mortality from AIDS, and other mortality.42,43 These parameters can then be combined with additional indicators (eg, antenatal care surveillance data, risk behavior) to model a counterfactual, that is, the expected epidemic trajectory without the program. To assess whether any changes can be attributed to the program, observed prevalence postimplementation is then compared with the expected prevalence estimated by the model. As with all methods, this technique has important prerequisites, such as the availability of cross-sectional data before program implementation. In addition, the accuracy of the model depends on the modeling structure and assumptions and the quality and time delay of information collected before the intervention.
Examining Outcomes With Valid Means of Measurement
The success of combination HIV prevention packages must be measured in terms of impact on HIV infection, and a central problem for evaluators is therefore the absence of a reliable incidence assay. There are 2 primary ways to measure changes in HIV incidence. Prospective (or retrospective) cohort studies nested within larger study designs (such as a stepped wedge) allow for direct comparisons of incidence between program and nonprogram areas or communities exposed and not exposed to the program. However, participant retention can be a formidable challenge requiring significant resources. Significant loss to follow-up and out-migration often results in a cohort that evolves to become distinct from the underlying source population it was selected to represent (cohort bias). In contrast, data from serial cross-sectional household serosurveys can be used to “indirectly” measure HIV incidence by parsing the epidemiologic change in prevalence into incidence, mortality from AIDS, and other mortality.42,43 This technique requires the availability of large representative datasets and advanced modeling, however, results using this method will only be highly suggestive compared to direct incidence measures.
Combination Evaluation for Combination Prevention
The combination of multiple “types” of design elements may facilitate the most robust evaluation of combination prevention interventions. For example, control of program implementation order (either random or nonrandom stepped wedge designs) allows for comparisons of program effectiveness between program and nonprogram areas. In addition, a nested prospective cohort study within a phased design allows for direct assessment of HIV incidence in a subsample of the population. Serial cross-sectional surveys punctuating the beginning, middle, and end of the evaluation can rapidly assess changes in behavior and population seroprevalence and can also be used to assess whether the characteristics of the population-based cohort remained representative of the general population over the course of program implementation. In addition, mathematical modeling with the data from cross-sectional surveys can be used to indirectly assess incidence, which can be compared with the direct incidence measure from the cohort.
Designing the Study to Allow for Mid-Course Corrections
We do not have the luxury of waiting several years to learn whether a program works and cannot settle for “black box” evaluations that provide no indication as to what worked until completion 5 or more years after the inception of the program (another hold-over from biomedical studies governed by regulatory requirements). Thus, although direct measurement of HIV infection outcomes is essential, all evaluation plans must be dynamic, providing ongoing feedback with implementation milestones (eg, program coverage, proximal outcomes), interim assessments, and mid-course corrections based on intermediate outcomes (including trends in impact). In particular, trends in the incidence of curable STIs, pregnancy, knowledge of prevention methods, or self-reported measures of sexual risk behaviors can be used to ensure that programs are on track to maximize the likelihood of being able to assess impact and to make preliminary results public knowledge. The success of these study designs relies on the tight integration of interim program assessments with feedback loops.
Flexible or adaptive group sequential study designs also permit mid-study design modifications that are based on a series of “preplanned” interim assessments to determine impact on intermediate outcomes.44-46 This information can then be used to adjust sample size, modify or jettison ineffective components of the intervention, or even abandon the entire program without compromising the ultimate statistical assessment of results. Adaptive designs can have significant ethical advantages in both ensuring that more individuals receive the optimized program and providing earlier preliminary evidence of effect.
A editorial in the Lancet called for evaluation to be the top priority in global health, noting that “a lack of knowledge about whether aid works undermines everybody's confidence in global health initiatives, and threatens the great progress so far made in mobilizing resources and political will for health programs.”14 The evaluation challenge for HIV prevention is that we must now begin to assess large, complex, and heterogeneous prevention programs, the components of which often have uncertain efficacy. To facilitate this process, Ambassador Eric Goosby, US Global AIDS Coordinator, has reaffirmed PEPFAR's commitment to rigorous evaluation of combination prevention programs under the Public Health Evaluation umbrella. In the coming months, OGAC will be drafting requests for applications for combination prevention strategies, with support for up to 5 years. In addition, a Combination Prevention Secretariat is being created as a joint collaboration of the Gates Foundation, PEPFAR, Joint United Nations Programme on HIV/AIDS, and the World Bank. The Secretariat will coordinate peer review and implementation of all combination prevention studies (both US government and non-US government supported).
Clearly, evaluation design is not a one-size-fits-all endeavor. Combination programs should be evaluated with the most rigorous methodological designs possible. Evaluators must determine the most appropriate method based on their level of control over program implementation, the resources, feasibility, and political will for a prospective or randomized design and the availability and quality of existing data. However, given the level of resources committed to HIV prevention efforts at the expense of treatment, it is critical that we continue to build a credible knowledge base on what works for HIV prevention.
The authors would like to acknowledge the contributions of participants at the Evaluation of Complex, Combination Prevention Programs meeting.
1. Padian NS, McCoy SI, Balkus JE, et al. Weighing the Gold in the Gold Standard: Challenges in HIV
Prevention Research. AIDS
2. Padian NS, Buve A, Balkus J, et al. Biomedical interventions to prevent HIV
infection: evidence, challenges, and way forward. Lancet
3. Piot P, Bartos M, Larson H, et al. Coming to terms with complexity: a call to action for HIV
4. Merson M, Padian N, Coates TJ, et al. Combination HIV
5. Karim QA, Karim SS, Frohlich JA, et al. Effectiveness and safety of tenofovir gel, an antiretroviral microbicide, for the prevention of HIV
infection in women. Science
6. National Institutes of Health. Daily Dose of HIV
Drug Reduces Risk of HIV
Infection. NIH's NIAID-sponsored study finds pre-exposure prophylaxis concept effective among men who have sex with men. Press Release 2010; Available at: http://www.niaid.nih.gov/news/newsreleases/2010/Pages/iPrEx.aspx
. Accessed November 30, 2010.
7. Network of Networks for Impact Evaluation
(NONIE). Impact Evaluations and Development
. Washington, DC; NONIE; 2009.
8. Victora CG, Black RE, Boerma JT, et al. Measuring impact in the Millennium Development Goal era and beyond: a new approach to large-scale effectiveness evaluations. Lancet
9. Baker JL. Evaluating the Impact of Development Projects on Poverty: A Handbook for Practitioners
. Washington, DC: The World Bank; 2000.
10. Craig P, Dieppe P, Macintyre S, et al. Developing and evaluating complex interventions: the new Medical Research Council guidance. BMJ
11. Minnis AM, Steiner MJ, Gallo MF, et al. Biomarker validation of reports of recent sexual activity: results of a randomized controlled study in Zimbabwe. Am J Epidemiol
12. Peterman TA, Lin LS, Newman DR, et al. Does measured behavior reflect STD risk? An analysis of data from a randomized controlled behavioral intervention study. Project RESPECT Study Group. Sex Transm Dis
13. Lagarde E, Auvert B, Chege J, et al. Condom use and its association with HIV
/sexually transmitted diseases in four urban communities of sub-Saharan Africa. AIDS
. 2001;15(suppl 4):S71-S78.
14. The Lancet. Evaluation: the top priority for global health. Lancet
15. Kurth AE, Celum C, Baeten JM, et al. Combination HIV
prevention: significance, challenges, and opportunities. Curr HIV/AIDS Rep
16. Duflo E, Kremer M. Use of randomization in the evaluation of development effectiveness. Conference on Evaluation and Development Effectiveness; July 15-16, 2003; Washington, DC.
17. Parker SW, Teruel GM. Randomization and Social Program Evaluation: The Case of Progresa. Ann Am Acad Polit Soc Sci
18. Rivera JA, Sotres-Alvarez D, Habicht JP, et al. Impact of the Mexican program for education, health, and nutrition (Progresa) on rates of growth and anemia in infants and young children: a randomized effectiveness study. JAMA
19. Hall AJ, Inskip HM, Loik F, et al. The Gambia Hepatitis Intervention Study. Cancer Res
20. Viviani S, Carrieri P, Bah E, et al. 20 years into the Gambia Hepatitis Intervention Study: assessment of initial hypotheses and prospects for evaluation of protective effectiveness against liver cancer. Cancer Epidemiol Biomarkers Prev
21. Victora CG, Habicht JP, Bryce J. Evidence-based public health: moving beyond randomized trials. Am J Public Health
22. Habicht JP, Victora CG, Vaughan JP. Evaluation designs for adequacy, plausibility and probability of public health programme performance and impact. Int J Epidemiol
23. Peterman TA. Can we measure STD risk behavior or STD as surrogates for HIV
risk? Presented at International Congress of Sexually Transmitted Infections (ISSTDR/IUSTI); 2001; Bundesgesundheitsblatt-Gesundheitsforschung-Gesundheitsschutz: Berlin, Germany.
25. Brown CA, Lilford RJ. The stepped wedge trial design: a systematic review. BMC Med Res Methodol
26. Hussey MA, Hughes JP. Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials
27. Committee on Qualification of Biomarkers and Surrogate Endpoints in Chronic Disease. Evaluation of Biomarkers and Surrogate Endpoints in Chronic Disease
. Washington, DC: Institute of Medicine; 2010.
28. Aral SO, Peterman TA. A stratified approach to untangling the behavioral/biomedical outcomes conundrum. Sex Transm Dis
29. Gray RH, Wawer MJ. Reassessing the hypothesis on STI control for HIV
30. Rothman KJ, Greenland S. Modern Epidemiology
. 2nd ed. Philadelphia, PA: Lippincott Williams & Wilkins; 1998.
31. Stephenson J, Imrie J. Why do we need randomised controlled trials to assess behavioural interventions? BMJ
32. Rosen L, Manor O, Engelhard D, et al. In defense of the randomized controlled trial for health promotion research. Am J Public Health
33. Bonell C, Hargreaves J, Strange V, et al. Should structural interventions be evaluated using RCTs? The case of HIV
prevention. Soc Sci Med
34. Arifeen SE, Hoque DM, Akter T, et al. Effect of the integrated management of childhood illness strategy on childhood mortality and nutrition in a rural area in Bangladesh: a cluster randomised trial. Lancet
35. Grosskurth H, Mosha F, Todd J, et al. Impact of improved treatment of sexually transmitted diseases on HIV
infection in rural Tanzania: randomised controlled trial. Lancet
36. Pronyk PM, Hargreaves JR, Kim JC, et al. Effect of a structural intervention for the prevention of intimate-partner violence and HIV
in rural South Africa: a cluster randomised trial. Lancet
37. The Gambia Hepatitis Intervention Study. The Gambia Hepatitis Study Group.Cancer Res
38. Arifeen SE, Hoque E, Akter T, et al. Effect of the integrated management of childhood illness strategy on childhood mortality and nutrition in a rural area in Bangladesh: a cluster randomised trial. Lancet
39. Skoufias E. PROGRESA and Its Impacts on the Welfare of Rural Households in Mexico
. Washington, DC: International Food Policy Research Institute; 2005.
41. Arnold BF, Khush RS, Ramaswamy P, et al. Causal inference methods to study nonrandomized, preexisting development interventions. Proc Natl Acad Sci U S A
42. Hallett TB, Zaba B, Todd J, et al. Estimating incidence from prevalence in generalised HIV
epidemics: methods and validation. PLoS Med
43. Rehle TM, Hallett TB, Shisana O, et al. A decline in new HIV
infections in South Africa: estimating HIV
incidence from three national HIV
surveys in 2002, 2005 and 2008. PLoS One
44. Chow S, Chang M. Adaptive Design Methods in Clinical Trials
. Boca Raton, FL: Chapman & Hall/CRC; 2006.
45. Bauer P, Brannath W. The advantages and disadvantages of adaptive designs for clinical trials. Drug Discov Today
46. van der Laan MJ. The Construction and Analysis of Adaptive Group Sequential Designs
. Berkeley, CA: U.C. Berkeley Division of Biostatistics Working Paper Series, University of California; 2008.