AIDS

Home Current Issue Previous Issues Published Ahead-of-Print Collections For Authors Journal Info
Skip Navigation LinksHome > September 8, 2000 - Volume 14 - Issue 13 > Considerations in choosing a primary endpoint that measures...
AIDS:
8 September 2000 - Volume 14 - Issue 13 - pp 1961-1972
Clinical

Considerations in choosing a primary endpoint that measures durability of virological suppression in an antiretroviral trial

Gilbert, Peter B; Ribaudo, Heather J; Greenberg, Lisa; Yu, George; Bosch, Ronald J; Tierney, Camlin; Kuritzkes, Daniel R

Free Access
Article Outline
Collapse Box

Author Information

From the aCenter for Biostatistics in AIDS Research and Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA, USA; bAgouron Pharmaceuticals, 11095 Torreyana Road, San Diego, CA, USA; and cUniversity of Colorado Health Science Center, 4200 East Ninth Avenue, Denver, CO, USA.

Received: 1 June 2000; accepted: 8 June 2000.

Sponsorship: This research was supported by grants 5-U01-AI38855, 5-U01-AI28076, and 5-U01-AI32770 from the National Institutes of Health, NIAID.

Requests for reprints to: Dr Peter Gilbert, Department of Biostatistics, Harvard School of Public Health, 655 Huntington Ave, Boston, MA 02115, USA. Tel: +1 617 432 2899; fax: +1 617 739 1781; e-mail pgilbert@hsph.harvard.edu

Collapse Box

Abstract

Objectives: At present, many clinical trials of anti-HIV-1 therapies compare treatments by a primary endpoint that measures the durability of suppression of HIV-1 replication. Several durability endpoints are compared.

Design: Endpoints are compared by their implicit assumptions regarding surrogacy for clinical outcomes, sample size requirements, and accommodations for inter-patient differences in baseline plasma HIV-1-RNA levels and in initial treatment response.

Cited Here...: Virological failure is defined by the non-suppression of virus levels at a pre-specified follow-up time T (early virological failure), or by relapse. A binary virological failure endpoint is compared with three time-to-virological failure endpoints: time from (i) randomization that assigns early failures a failure time of T weeks; (ii) randomization that extends the early failure time T for slowly responding subjects; and (iii) virological response that assigns non-responders a failure time of 0 weeks. Endpoint differences are illustrated with Agouron's trial 511.

Cited Here...: In comparing high with low-dose nelfinavir (NFV) regimens in Agouron 511, the difference in Kaplan-Meier estimates of the proportion not failing by 24 weeks is 16.7% (P = 0.048), 6.5% (P = 0.29) and 22.9% (P = 0.0030) for endpoints (i), (ii) and (iii), respectively. The results differ because NFV suppresses virus more quickly at the higher dose, and the endpoints weigh this treatment difference differently. This illustrates that careful consideration needs to be given to choosing a primary endpoint that will detect treatment differences of interest.

Cited Here...: A time from randomization endpoint is usually recommended because of its advantages in flexibility and sample size, especially at interim analyses, and for its interpretation for patient management.

Back to Top | Article Outline

Introduction

As a result of the growing evidence that the suppression of HIV-1 replication by antiretroviral drugs reduces the risk of progression to AIDS and death [1-12], many current and planned clinical trials of antiretroviral regimens use some measure of durability of HIV-1 suppression as the primary efficacy endpoint [13]. Typically, in antiretroviral trials, a primary endpoint for 'durability' has been determined as whether a subject has HIV-1 plasma RNA levels below a fixed suppression threshold (such as the detection or quantification limit of the HIV-1-RNA assay) at a pre-specified follow-up time. More recently, antiretroviral trials have employed alternative measures of durability based on the time-to-virological failure, i.e. the time until plasma HIV-1 RNA again becomes detectable.

There are a number of ways to define a time-to-virological failure endpoint, which differ in their interpretation and in their implicit surrogacy assumptions regarding the association between the pattern of virological response and clinical benefit. For example, if one assumes that the magnitude of the clinical benefit of treatment (i.e. reduction in disease progression or morbidity) depends on how rapidly viral suppression is achieved, then the primary endpoint might be defined so as to weight a rapid virological response heavily. Alternatively, if one assumes that the clinical benefit of treatment depends solely on maintaining complete virological suppression but is independent of the speed with which suppression is achieved, then the endpoint could be defined to minimize or ignore the time taken to achieve virological suppression and to focus exclusively on the time of virological failure. Here, we describe several possible definitions of 'time-to-virological failure' endpoints, with the aim of elucidating the surrogacy assumptions (regarding the speed of virological response, baseline viral load and other factors) that underlie the appropriate use of each endpoint. In addition, we compare these endpoints with a binary endpoint that considers the lack of suppression through a specified timepoint as virological failure, but does not distinguish at all between different ways to reach the endpoint.

In this article, our focus is on the criteria for choosing a single primary efficacy endpoint by which two or more randomized treatments may be compared. Using only one endpoint to address the primary scientific hypothesis is critically important for ensuring a valid test of treatment effects, appropriate statistical monitoring, and the interpretability of study results for regulatory decision-making [14]. We focus on endpoints that assess the durability of suppression using a subject's entire longitudinal viral load profile; cross-sectional endpoints that measure virological suppression status at a single timepoint are not considered. Cross-sectional endpoints may be useful in secondary analyses (M.D. Hughes, 1999, in preparation), but problems in their interpretation argue against their use in primary analyses. To illustrate this, in the Adult AIDS Clinical Trials Group (AACTG) trial 347 [15], all subjects assigned amprenavir monotherapy failed virologically early on and were offered a potent rescue therapy, which falsely made monotherapy look superior to the amprenavir triple therapy arm by the 24 week cross-sectional analysis.

Ideally, the primary endpoint should exhibit statistical validity in the inference of treatment differences, clear interpretability, practical utility for patient management, and surrogacy, in that tangible treatment benefit to the patient is mediated through the endpoint [16-19]. Endpoints that potentially compromise the statistical validity conferred by randomization are not considered as viable candidates. For example, some trials compare treatments by the total time that virus levels are suppressed in the subgroup of subjects with virological suppression to undetectable levels. Although this is an interesting secondary analysis (Dupont's Study 006 [20]), it is not appropriate for a primary analysis, because population selection occurs subsequent to randomization, which exposes the treatment evaluation to selection bias.

Back to Top | Article Outline

Methods

We compare virological failure endpoint definitions for an antiretroviral trial with two treatment arms, A and B, in which plasma HIV-1-RNA levels are measured at regularly scheduled intervals after randomization, and results are reported back to subjects in real time. The trial pre-defines a 'virological suppression threshold' HIV-1-RNA copy number, below which HIV-1 replication is considered to be suppressed. The chosen suppression threshold will be arbitrary unless data are available that suggest a biological meaning. Although further predictive data are needed, it appears at present that a relatively low quantification limit, e.g. 50 copies/ml, may be a biologically reasonable suppression threshold [21-27]. However, because inter-assay variability increases at low HIV-1-RNA copy numbers [28,29], less stringent definitions may be appropriate. For example, current AACTG guidelines recommend a suppression threshold of 200 copies/ml for studies using the Roche Ultrasensitive HIV-1 Monitor assay (Roche Molecular Systems, Pleasanton, CA, USA). In addition, when the complete suppression of virus is not expected (e.g. in advanced disease and highly antiretroviral-experienced patients), less stringent definitions may be reasonable.

The trial also needs to pre-define a 'virological failure threshold' HIV-1-RNA copy number above which the treatment is judged to fail to maintain virological suppression. We assume all participants have virus levels above the suppression threshold at baseline and that both drug regimens are capable of driving a proportion of subjects' virus levels below this threshold. Study subjects whose virus levels go below the suppression threshold are deemed 'responders'. A responder whose virus level subsequently rises above the failure threshold is deemed to experience virological 'relapse'.

Back to Top | Article Outline
Accounting for an unsuccessful initial response to treatment

As it is not guaranteed that all subjects will achieve virological suppression, in defining any virological failure endpoint it is necessary to pre-specify a window period between randomization and week T, such that individuals who do not respond by week T are considered to have failed virologically. Guidelines support that the failure of a regimen to suppress virus within a certain timeframe indicates the futility of the regimen ever to suppress virus [30], indicating an increased risk for clinical outcomes, which may imply that the treatment should be switched. This suggests that T can be reasonably defined as the time by which non-response is believed to indicate the futility of the studied regimens. Practically, T may be chosen as the duration of therapy by which most responders (e.g. > 95%) are expected to have achieved virological suppression.

Some subjects may show evidence of failing treatment significantly before time T, for example if the initial virological response is weak or absent, or if the initial fall in HIV-1 RNA quickly rebounds. To manage such patients appropriately, the trial should allow subjects to change treatments before time T[31]. For example, in ongoing AACTG trial 388, which compares three indinavir-containing regimens, subjects showing a consistent increase in HIV-1-RNA levels above baseline or 1 log10 above their nadir during the first T = 24 weeks after randomization are considered 'early failures'. These subjects then have the option to switch treatment, which does not impact the primary endpoint. In a completed trial, AACTG 347, counting a confirmed 1 log10 increase in HIV-1-RNA levels above nadir as virological failure contributed to the decision to stop the inferior monotherapy amprenavir arm 19 weeks after the first subject was randomly assigned [15]. Early failure criteria need to be well-defined to avoid the misclassification as failures of subjects whose HIV-1-RNA elevations are a direct result of random assay variability, illness, an intercurrent vaccination, or a short hiatus from taking therapy.

Back to Top | Article Outline
Primary endpoint definitions

Two distinguishing features of a time-to-virological failure endpoint are the time origin and the rule for assigning failure times to subjects who fail early. Typically, the time origin is the time of randomization, but some studies use the time of virological response. For a 'time from randomization' endpoint, one rule that has been used assigns all early failures within T weeks a failure time of T weeks. The rationale for this rule is the lack of assurance that the orderings of early virological failure events reliably predict orderings in clinical progression rates. This issue is related to the question of surrogacy of the endpoints and is addressed further in later discussion. This endpoint is used as the primary endpoint in AACTG trial 388, in which all subjects without HIV-1 RNA of less than 200 copies/ml by week T = 24 are considered to have failed at 24 weeks.

Another rule that has been used assigns all subjects who do not respond by week T a failure time of 0 weeks. The purpose of this endpoint is to give the failure time a durability interpretation as the total time that virus levels are suppressed. Time from randomization and time from response versions of this endpoint have been used in some studies, including Dupont's efavirenz 006 trial [20] and Agouron's nelfinavir (NFV) 511 trial (with T = 24 weeks) [21], respectively.

To avoid bias in treatment comparisons, the censoring times of subjects who drop out should correspond to the minimum failure time assigned by an endpoint. Subjects censored before week 24 in AACTG 388 should thus be assigned a censoring time of week 24, and subjects censored before 24 weeks in Agouron 511 should be assigned a censoring time of 0 weeks. To illustrate how bias can result under other rules, suppose in Agouron 511 non-responding drop-outs were censored at the time of last contact. Standard time-to-failure statistical methods would consider these subjects to have been at risk for the period of observation, but would consider uncensored non-responders followed during the same period to never have been at risk, which could bias inferences. This illustrates a broader conceptual problem with endpoints that assign failure times of 0; subjects must be followed prospectively to ascertain responder status, and non-responders are retrospectively declared to have never been at risk of failure.

Participants in antiretroviral trials may exhibit wide variations in the rapidity of treatment response, which may depend on pre-treatment plasma HIV-1-RNA levels, drug absorption, or adherence. Subjects with high baseline virus levels may not achieve maximal suppression until 12-24 weeks of potent therapy because of a slower 'second phase' decline after the initial drop in plasma HIV-1-RNA levels [30,32]. Within the primary endpoint definition, these issues are accommodated through appropriate choice of the early failure time T. Choosing T large will allow subjects sufficient time to suppress before considering them to have failed, but this may extend the period of inefficient treatment for subjects who are truly non-responders. An alternative to increasing T is to add a second non-response period between T and T +t weeks. The resulting 'double-window' endpoint allows subjects who have not responded by week T, but who exhibit declining virus levels, an extra t weeks to respond before being considered to have failed.

The final endpoint we consider, a binary endpoint, considers only whether a patient has experienced failure (due to early virological failure by week T or by virological relapse) at any time before the final efficacy analysis. The binary endpoint has a clear interpretation, measuring whether the suppression of virus has been sustained throughout a specified time period.

Table 1 lists some primary endpoints for durability that have been used in antiretroviral trials.

Table 1
Table 1
Image Tools
Back to Top | Article Outline

Results

Comparison of time-to-virological failure endpoints

A central criterion for evaluating the appropriateness of a virological endpoint is the correctness of its implicit assumptions regarding surrogacy for clinical endpoints in its ranking of different plasma HIV-1-RNA response profiles. For a given time-to-failure definition, longer times imply a superior treatment benefit to the patient. Therefore, each endpoint prescribes an ordering for what is considered good and bad for a patient. The ordering varies considerably among endpoint definitions. This is illustrated between the different endpoints by considering 12 hypothetical virological response profiles in a 48 week study (Fig. 1(a-e)).

Fig. 1
Fig. 1
Image Tools

Patients A and B respond at weeks 8 and 16, respectively, each maintaining suppression at 48 weeks. The time from randomization endpoints make no distinction between these observations, whereas the time from response endpoint orders these observations: the one who was suppressed sooner is considered to have fared better. Patients C and D also respond at weeks 8 and 16, respectively, each rebounding at week 40. Again, the time from randomization endpoints do not order these observations, whereas the time from response endpoint considers the profile with earlier suppression to be superior. Patients E and F experience early virological failure in different ways, by a quick response followed by a rapid relapse and by non-response, respectively. The time from randomization endpoint that assigns non-responders a failure time of 16 weeks considers the treatment effects equal, whereas the time from randomization and time from response endpoints that assign non-responders a failure time of 0 weeks consider response E superior. A comparison of patients G and H illustrates that the time from response endpoint can consider an earlier virological rebound to represent a superior treatment benefit than a later virological rebound.

As these examples illustrate, the principal difference between the time from randomization endpoint that assigns early failures a failure time of T weeks and the time from randomization/response endpoints that assign non-responders a failure time of 0 weeks is that the latter endpoints make stronger ordering assumptions about the kinds of virological response patterns that constitute superior clinical efficacy. The endpoint disparity is largest for subjects who fail early, with the latter endpoints ordering different kinds of early response and the former endpoint considering all kinds of early virological failure to be equally bad. Furthermore, the double-window endpoint can give markedly different orderings of good and bad treatment responses than the other endpoints (Fig. 1(e)).

Back to Top | Article Outline
Binary endpoint versus time-to-failure endpoints

In that the binary endpoint does not attempt to impose an ordering of failures over time, it has the advantage of making the simplest and least presumptive surrogacy assumption: all failures throughout the follow-up period are assumed equal.

For a binary endpoint, the proportion of subjects who fail by the analysis time is compared across the treatment groups. These proportions should be estimated by the Kaplan-Meier method, because it correctly accommodates the differential follow-up of subjects and censoring by study withdrawal. For this reason, and because it has optimal statistical efficiency, a Z-test based on the difference in Kaplan-Meier estimates is recommended for testing for a treatment difference. A simpler test that compares proportions without considering individual follow-up times (e.g. a chi-squared test) should not be used, because it can give biased results [33].

Generally, a larger sample size is needed for a binary endpoint than for a time-to-failure endpoint. Assuming proportional hazards for the treatments, the standard test for a time-to-failure endpoint, the log-rank test, is always at least as efficient as the most efficient test for comparing proportions [34]. The efficiency gain of the log-rank test is greater when the accrual period is relatively long compared with the mean time to virological failure. For example, assuming no withdrawal, uniform accrual over 6 months, and 1 year additional follow-up, the sample size savings are 2, 5, and 15% when the pooled failure rate by the analysis time at 1.5 years is 30, 50, and 70%, respectively [35]. For a 1 year accrual period with 6 months additional follow-up, the sample size savings increase to 5, 12, and 25%, respectively. If the rate of study withdrawal is not extremely high (< 50%), then the sample size savings increase further [35]. In addition, the sample size gain will be larger at interim analyses than at final analyses, in proportion to the fraction of subjects who have less follow-up time than the specified interim analysis time.

The above relative sample size calculations do not consider the grouping of failure times by protocol specified measurement times, which implies that failure times are interval censored. If HIV-1 RNA is infrequently measured, then the log-rank test is no longer appropriate, and a modified test that explicitly accounts for the interval censoring may be preferred [36].

A time-to-failure endpoint also has advantages for evaluating covariate effects and for flexibility in extending the study by prolonging the follow-up period. Perhaps most importantly, its interpretation corresponds more closely to clinical practice than that of a binary endpoint, because physicians monitor viral load levels in patients over time, and consider switching therapy when suppression is lost or when the virus is not responding.

Back to Top | Article Outline
Example: Agouron's trial 511 of nelfinavir, zidovudine, and lamivudine

Agouron's trial 511, which compared zidovudine (ZDV) plus lamivudine (3TC) to ZDV/3TC plus 500 mg NFV to ZDV/3TC plus 750 mg NFV in 297 antiretroviral-naive subjects, used the time from response endpoint that assigns non-responders within 24 weeks a failure time of 0 weeks as the primary efficacy endpoint. Virological response was defined as two consecutive plasma HIV-1-RNA measurements at or below 400 copies/ml, the quantification limit of the Roche Amplicor Standard assay. The response time was defined as the clinic visit date of the first such measurement, and the time of virological failure was defined as the first date of two consecutive HIV-1-RNA measurements above 400 copies/ml. An analysis of data from the first 24 weeks of follow-up showed conclusively that NFV at either dose significantly improved the durability of suppression over ZDV/3TC alone (log-rank test P values < 0.001). Consequently, after the interim analysis of the 24 week data, the ZDV/3TC arm was discontinued. There was a 24 week blinded extension period for continuing the comparison of the two NFV-containing regimens. Subjects on the ZDV/3TC arm were offered randomization to the NFV arms.

To our knowledge, this was the first anti-HIV-1 efficacy trial to use a time-to-virological failure primary endpoint. We re-analysed the data with five virological endpoints, all of which use a suppression threshold and a failure threshold equal to 400 copies/ml. In addition to the primary endpoint, the studied endpoints are: time from randomization that assigns non-responders within 8 weeks a failure time of 8 weeks, time from randomization that assigns non-responders within 16 weeks a failure time of 16 weeks, a binary endpoint measured at the study termination week 24 or 48, and a double-window endpoint with two 8 week windows. For the double-window endpoint, subjects whose HIV-1-RNA levels were above 400 copies/ml at week 8 but had dropped at least 1.2 log10 from baseline were allowed to remain on study drug until week 16, whereas all other subjects with viral load greater than 400 copies/ml at week 8 were counted as virological failures. These endpoints are chosen to compare a time from response endpoint to time from randomization endpoints, to assess the impact of the choice of the early failure window period for the time from randomization endpoint (8 versus 16 weeks), to compare a binary endpoint to time-to-failure endpoints, and to compare a double-window endpoint with the other endpoints. We compared treatments in all randomly selected subjects and in sub-groups defined by baseline HIV-1-RNA strata. Sub-group analyses were conducted to exemplify how the study population may imply the need for a different choice of endpoint.

For each endpoint, Table 2 displays treatment effects as the difference in Kaplan-Meier estimates of the proportion not failed by week 24 or 48. Notice that the treatment effect sizes are equal for the time from response and binary endpoints, and the respective log-rank and binary Z-test P-values are comparable for all analyses. This shows that Agouron's original conclusions [21] are not changed if a binary endpoint is used.

Table 2
Table 2
Image Tools

Both NFV arms perform highly significantly better than the ZDV/3TC arm (P < 0.001) according to each of the endpoints (Fig. 2). An instructive exception is found in the comparison of the 500 mg NFV arm with the ZDV/3TC arm in the high baseline RNA stratum. The treatment difference is 10.0% (P = 0.039) for the time from randomization with the T = 8 endpoint, whereas it ranges between 33.9 and 50.6% (P < 0.001) for the other endpoints (Table 2). This is because all nine ZDV/3TC responders responded within 8 weeks (median 3.9 weeks), whereas 36% of 500 mg NFV responders responded between weeks 8 and 16. The time from randomization with the T = 8 endpoint thus placed these Kaplan-Meier curves closer together at week 8 than they were placed by the other endpoints. The clear superiority of the 500 mg NFV arm over ZDV/3TC implies that the endpoint with the 8 week window gives a misleading impression, suggesting that an 8 week window period is too short.

Fig. 2
Fig. 2
Image Tools

The sensitivity of treatment comparisons to the window size is further illustrated by the comparison of the two NFV arms through 24 weeks in the stratum with low baseline RNA. In this population, the time from randomization with T = 16 and double-window endpoints yield non-significant treatment differences (P > 0.05), whereas the other three endpoints yield significant differences (P < 0.05). The reason for this is that a large proportion of responders on the 500 mg NFV arm failed shortly after 8 weeks, and the time from randomization endpoint with T = 16 and the double-window endpoint assigned these subjects a relatively long failure time, 16 weeks. A choice of T = 8 would have given significant results (data not shown).

Fig. 3 shows Kaplan-Meier curves for the two NFV-containing arms extending to 48 weeks. Among responders, subjects on the 500 mg NFV arm took a median 7.6 weeks to respond compared with 6.1 weeks for subjects on the 750 mg NFV arm. Reflecting the quicker response to higher dose NFV, between weeks 4 and 24 the time from response curves diverge to a large degree, the time from randomization curves with T = 8 diverge to a lesser degree, and the time from randomization curves with T = 16 are nearly parallel. Also, the amount of separation of the curves at any timepoint decreases across these endpoints, e.g. at week 16 the difference decreases from 24.3 to 19.0 to 11.5%. This illustrates the influence of the differential speed of virological response on the time from the response endpoint.

Fig. 3
Fig. 3
Image Tools

Fig. 3 illustrates that the double-window endpoint can give dramatically different results compared with the other time-to-failure endpoints. The log-rank test for comparing NFV-containing arms is not significant for the double-window endpoint (P = 0.29), but is highly significant for the time from response, time from randomization with T = 8, and binary endpoints (P ≤ 0.003), and marginally significant for the time from randomization with the T = 16 endpoint (P = 0.048). To see why, notice that the Kaplan-Meier curves are close together before week 16 for the double-window endpoint but have a similar substantial separation for the other endpoints. The curves for the double-window endpoint are close up to week 8 because most of the subjects on the 500 mg NFV arm who took more than 8 weeks to respond were not declared to have failed at week 8 because they had a drop of at least 1.2 log10 from baseline.

In conclusion, the various endpoints gave comparable results for NFV versus non-NFV arms, but appreciable differences were found for the low versus the high-dose NFV arms. The binary endpoint, time from response endpoint, and time from randomization endpoint with T = 8 gave similar results that differed from both the time from randomization endpoint with T = 16 and the double-window endpoint. A quicker virological response for the high-dose group apparently caused these differences, as the time from response and the time from randomization with T = 8 endpoints credited this to the high-dose treatment's favor, whereas the latter two endpoints did not.

For the dose comparison, which endpoint gives the best clinical description of treatment differences? Not the double-window endpoint, as the clinically unimportant factor that the proportion of non-responders by week 16 who had a minimum 1.2 log10 drop from baseline at week 8 was differential between the arms led it to give conclusions inconsistent with the other endpoints. Among the other endpoint choices, one's beliefs about surrogacy for clinical outcomes determine the preferred endpoint.

We propose that the time from randomization endpoint with T = 16 makes the most plausible surrogacy assumptions. It is more appropriate than the time from response endpoint and the time from randomization endpoint with T = 8 if one assumes that the timing of virological relapse is more predictive of the risk of clinical progression than the speed of virological response, and that orderings in virological relapse times between weeks 8 and 16 are not predictive of the time to clinical events. Furthermore, this endpoint is a more reliable surrogate endpoint than the binary endpoint if orderings in virological failure times between 16 and 48 weeks are somewhat predictive of orderings in clinical event failure times. In this case, the binary endpoint needlessly ignores information that some responding subjects relapse sooner than others.

Of course, until ample data from various trials with long-term follow-up are available that delineate relationships between treatment effects on virological endpoints and on clinical endpoints, there will be considerable uncertainty about the most clinically relevant endpoint. Therefore, study teams must understand the clinical beliefs that underlie the appropriate use of various candidate virological endpoints, and carefully select the endpoint that is believed most reliably to address the clinical objective of the trial. The Agouron 511 example highlights the importance of this, as the decision to license NFV at the 750 mg dose was based on the analysis of a particular endpoint (the time from response), and the lower dose could conceivably have been licensed if a different primary endpoint had been used.

Back to Top | Article Outline

Discussion

At present, many antiretroviral trials compare treatments primarily by the durability of virological suppression. Given the many ways that this can be defined and the differences in implicit assumptions that the endpoints make in ranking various virological response profiles, careful attention must be paid to the choice of primary endpoint at the design phase of the trial. Theoretical considerations and the application of several endpoint definitions to data from Agouron's Study 511 showed that the interpretation and results of treatment comparisons are sensitive to the adopted endpoint.

The observed virological response to treatment consists of two parts, the pattern of HIV-1-RNA values between randomization and virological response, and the time between virological response and relapse. To obtain a full description of virological treatment differences, it is important to analyse both aspects separately and jointly. The initial response can be analysed with secondary endpoints including the time-to-virological response, the rate of HIV-1-RNA change, the area-under-the-curve minus baseline [37,38], or decay rate parameters in viral dynamics models [39], each potentially informative about initial drug potency. The second component, the durability of treatment response, can be assessed with the time from response endpoint considered here. Alternatively, the durability of response can be assessed with a time from randomization endpoint, as analysed by a weighted log-rank statistic that downweights treatment differences during the initial response period.

The appropriate choice of virological primary endpoint also depends on the duration of follow-up, the investigational treatment regimens, and the patient population. For studies with relatively long follow-up such as AACTG trials 384 and 388 with 2 or more years, it is compelling to use a time-to-failure endpoint for its clinical interpretability. For studies with short follow-up (e.g. 24 or 48 weeks), the treatment regimens and patient population inform the choice of endpoint. If potent antiretroviral regimens are studied in antiretroviral-naive or mildly experienced patients, only modest efficiency gains can be realized by using a time-dependent endpoint over a binary endpoint for the final analysis, although the gains may be substantial for interim analyses. In heavily experienced populations or in salvage trials for previous virological failures, a large proportion of subjects may be expected to fail during a short follow-up period, and distinguishing between differences of several weeks of suppression via a time-to-failure endpoint, with improved power, may be a primary goal in comparing treatments. In addition, for antiretroviral-experienced populations, only a small fraction of subjects may have virus levels suppressed below the suppression threshold, in which case primary endpoints defined by clinical progression or by the amount of change in virus levels from baseline merit consideration.

Furthermore, in constructing a virological primary endpoint, it is necessary to decide how to handle non-virological events such as treatment discontinuation or change, AIDS-defining illnesses or death. One approach, taken by AACTG 388, is to use a purely virological endpoint, which only considers increases in plasma HIV-1-RNA levels above the failure threshold as failure events. According to the intent-to-treat principle, trial participants who switch treatments are analysed according to their originally randomized treatment, and are followed beyond the time of treatment discontinuation for virological endpoints. The opposite approach, taken by AACTG 384, is to use a composite endpoint that considers treatment discontinuation or switch, and AIDS-defining event, death, or study withdrawal as failure events in addition to virological failure. This endpoint deems a regimen successful if it adequately suppresses viral replication and the patient can tolerate it; otherwise it fails. This endpoint reflects the clinical belief that the need to switch regimens, regardless of whether it is precipitated by virological failure or intolerability, is a better surrogate for clinical progression than virological failure alone. The plausibility of this surrogate is supported by evidence that the need for treatment changes may imply greater problems with drug resistance (accumulation of more 'mutational cost') and toxicities, which diminishes the pool of viable future treatment options. A drawback of the composite endpoint is that it may preclude evaluations of efficacy and safety separately, thereby not permitting risk-benefit assessments. Generally, the choice of a virological versus a composite endpoint is driven by the primary scientific question, the patient population, and clinical beliefs about the mechanism of antiretroviral efficacy.

An endpoint must also adopt a rule for handling loss to follow-up. The most common rules count lost subjects as censored or as failure. These rules implicitly assume that loss to follow-up is non-informative or completely informative of the risk of virological failure, respectively. Because reasons for loss vary widely from drug toxicities to reasons independent of treatment, each assumption is probably imperfect. Therefore, it is important to conduct sensitivity analyses to determine how much the conclusions differ under different assumptions about loss to follow-up [40]. Incorporating sequential HIV-1-RNA measurements into a rule for handling loss to follow-up may improve the plausibility of the rule's implicit assumptions [20].

Back to Top | Article Outline

Conclusion

We propose that a purely virological time-to-failure endpoint is usually preferred as the primary endpoint, and recommend that it be used unless there are compelling arguments for a different endpoint. Because of its disadvantages at interim analyses, and its potential efficiency loss at final analyses, a binary endpoint should only be used when there is strong evidence or a belief that orderings of virological failure times within the follow-up period are not predictive of orderings in clinical failure times. Within the class of time-to-virological failure endpoints, we recommend a time from randomization endpoint that assigns subjects with early virological failure within T weeks a failure time of T weeks, where T is chosen large enough so that over 95% of responders are expected to respond by time T. This endpoint is preferable to the time from response endpoint under the belief that treatment differences in the time of virological rebound are more important clinically than treatment differences in the initial response rate. In the absence of clinical predictive data, this assumption is the most plausible, as the time of virological rebound is more proximal to the time of clinical outcomes. The recommended endpoint is also preferable to the time from randomization endpoint that assigns non-responders a failure time of 0 weeks, because it is unappealing, practically and statistically, retrospectively to declare subjects to have failed at the time of randomization.

Back to Top | Article Outline

Acknowledgements

The authors would like to thank members of the HIV Disease Research Agenda Committee of the Adult AIDS Clinical Trials Group and physicians at the University of California San Francisco for participating in a primary virological endpoint survey. The authors are also grateful to Victor DeGruttola and two anonymous referees for instructive review of the manuscript.

Back to Top | Article Outline

References

1. Mellors JW, Munoz A, Giorgi JV. et al. Plasma viral load and CD4+ lymphocytes as prognostic markers of HIV-1 infection. Ann Intern Med 1997, 126: 946 -954.

2. Phillips AN, Eron JJ, Bartlett JA. et al. HIV-1 RNA levels and the development of clinical disease. AIDS 1996, 10: 859 -865.

3. Hammer SM. Advances in antiretroviral therapy and viral load monitoring. AIDS 1996, 10 (Suppl. 3) : S1 -S11.

4. Hughes MD, Johnson VA, Hirsch MS. et al. Monitoring plasma HIV-1 RNA levels in addition to CD4+ lymphocyte count improves assessment of antiretroviral therapeutic response. Ann Intern Med 1997, 126: 929 -938.

5. Hammer SM, Squires KE, Hughes MD. et al. A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and CD4 cell counts of 200 per cubic millimeter or less. N Engl J Med 1997, 337: 725 -733.

6. Gulick R, Mellors J, Havlir D. et al. Treatment with indinavir, zidovudine, and lamivudine in adults with human immunodeficiency virus infection and prior antiretroviral therapy. N Engl J Med 1997, 337: 734 -739.

7. Collier AC, Coombs RW, Schoenfeld DA. et al. Treatment of human immunodeficiency virus infection with saquinavir, zidovudine, and zalcitabine: AIDS Clinical Trials Group. N Engl J Med 1996, 334: 1011 -1017.

8. Mocroft A, Vella S, Benfield TL. et al. Changing patterns of mortality across Europe in patients infected with HIV-1. Lancet 1998, 352: 1725 -1730.

9. Palella FJ Jr, Delaney KM, Moorman AC. et al. Declining morbidity and mortality among patients with advanced human immunodeficiency virus infection. N Engl J Med 1998, 338: 853 -860.

10. Vittinghoff E, Scheer S, O'Malley P, Colfax G, Holmberg SD, Buchbinder SP. Combination antiretroviral therapy and recent declines in AIDS incidence and mortality. J Infect Dis 1999, 179: 717 -720.

11. Detels Detels R, Munoz A, McFarlane G. et al. Effectiveness of potent antiretroviral therapy on time to AIDS and death in men with known HIV infection duration. JAMA 1998, 280: 1497 -1503.

12. Hoyert DL, Kochanek KD, Murphy SL. Deaths: final data for 1997. Natl Vital Stat Rep 1999, 47: 1 -104.

13. Saag MS, Holodniy M, Kuritzkes DR. et al. HIV viral load markers in clinical practice. Nat Med 1996, 2: 625 -629.

14. Moye LA. Alpha calculus in clinical trials: considerations and commentary for the new millenium. Stat Med 2000, 19: 767 -779.

15. Murphy RL, Gulick RM, De Gruttola V. et al. Treatment with amprenavir with or without zidovudine plus lamivudine in adults with human immunodeficiency virus infection. J Infect Dis 1999, 179: 808 -816.

16. Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med 1989, 8: 431 -440.

17. Fleming TR. Evaluation of active control trials in AIDS. J Acquir Immune Defic Syndr 1990, 3: S82 -S87.

18. Fleming TR. Evaluating therapeutic interventions (with Discussion and Rejoinder). Stat Sci 1992, 7: 428 -456.

19. Fleming TR. Surrogate markers in AIDS and cancer trials. Stat Med 1994, 13: 1423 -1435.

20. Staszewski S, Morales-Ramirez JO, Godofsky EW, et al. Longer time-to-treatment failure and durability of response with efavirenz + ZDV + 3TC: first analysis of full 1266 patient cohort from study 006.39th Interscience Conference on Antimicrobial Agents and Chemotherapy. San Francisco, January 1999 [Abstract 507].

21. Yu G, Clendenin NJ, Quart BD, Chang Y, Greenberg L. Statistical methodology for maximally HIV-RNA suppressive treatments in HIV infected subjects. Drug Inform J 2000, in press.

22. Schuurman R, Nijhuis M, van Leeuwen R. et al. Rapid changes in human immunodeficiency virus type 1 RNA load and appearance of drug-resistant virus populations in persons treated with lamivudine (3TC). J Infect Dis 1995, 171: 1411 -1419.

23. Gunthard HF, Wong JK, Ignacio CC, Guatelli JC, Riggs NL, Havlir DV, Richman DD. Human immunodeficiency virus replication and genotypic resistance in blood and lymph nodes after a year of potent antiretroviral therapy. J Virol 1998, 72: 2422 -2428.

24. Emini EA, Holder DJ, Schleif WA, et al. Evidence for prevention of new HIV-1 infection cycles in patients treated with indinavir plus zidovudine plus lamivudine. Program and Abstracts of the International Workshop on HIV Resistance, Treatment Strategies, and Eradication. St. Petersburg, June 1997 [Abstract 128].

25. Imanichi H, Zhang Y-M, Lane HC, Faloon J, Salzman NP. Continued evolution of HIV-1 during combination therapy despite levels of HIV-1 < 500 copies/ml. Program and Abstracts of the International Workshop on HIV Resistance, Treatment Strategies, and Eradication. St. Petersburg, June 1997 [Abstract 63].

26. Wong JK, Hezareh M, Gunthard HF, Havlir DV, Ignacio CC, Splina CA, Richman DD. Recovery of replication-competent HIV despite prolonged suppression of viremia. Science 1997, 278: 1291 -1295.

27. Finzi D, Hermankova M, Pierson T. et al. Identification of a reservoir for HIV-1 in patients on highly active antiretroviral therapy. Science 1997, 278: 1295 -1300.

28. Brambilla D, Granger S, Bremer J. Variation in HIV RNA assays at low RNA concentration. Program and Abstracts of the Seventh Conference on Retroviruses and Opportunistic Infections. San Francisco, January 2000 [Abstract 774].

29. Sun R, Ku J, Jayakar H. et al. Ultrasensitive reverse transcription-PCR assay for quantitation of human immunodeficiency virus type 1 RNA in plasma. J Clin Microbiol 1998, 36: 2964 -2969.

30. Carpenter CC, Fischl MA, Hammer SM. et al. Antiretroviral therapy for HIV infection in 1998: updated recommendations of the International AIDS Society - USA Panel. JAMA 1998, 28: 78 -86.

31. De Gruttola V, Hughes M, Gilbert P, Phillips A. Trial design in the era of highly effective antiviral drug combinations for HIV infection. AIDS 1998, 12: S149 -S156.

32. Ho DD. Can HIV be eradicated? Program and Abstracts of the Fourth Conference on Retroviruses and Opportunistic Infections. Washington DC; January 1997 [Abstract S1].

33. Spritzler JG, Tsiatis AA. A statistical method giving early and unbiased conclusions from clinical trials. Drug Inform J 1994, 28: 805 -816.

34. Cuzick J. The efficiency of the proportions test and the logrank test for censored survival data. Biometrics 1982, 38: 1033 -1039.

35. Gail MH. Applicability of sample size calculations based on a comparison of proportions for use with the logrank test. Cont Clin Trials 1985, 6: 112 -119.

36. Finkelstein DM. A proportional hazards model for interval-censored failure time data. Biometrics 1986, 42: 845 -854.

37. Dawson JD, Lagakos SW. Analyzing laboratory marker changes in AIDS clinical trials. J Acquir Immune Defic Syndr 1991, 4: 667 -676.

38. Dawson JD. Comparing treatment groups on the basis of slopes, areas-under-the-curve, and other summary measures. Drug Infect J 1994, 28: 723 -732.

39. Wu H, Ding AA, De Gruttola V. Estimation of HIV dynamic parameters. Stat Med 1998, 17: 2463 -2485.

40. Scharfstein DO, Rotnitzky A, Robins JM. Adjusting for nonignorable drop-out using semiparametric nonresponse models. J Am Stat Assoc 1999, 94: 1096 -1146.

Keywords:

Antiretroviral therapy; clinical trials; statistics; surrogate markers; viral load

© 2000 Lippincott Williams & Wilkins, Inc.

Search for Similar Articles
You may search for similar articles that contain these same keywords or you may modify the keyword list to augment your search.